Subscribe


White Papers

Free IDC Vendor Spotlight

Converged Infrastructue And Today's Buisness Needs

download now
Demo & Video
Untitled Document

Introduction to Converged Infrastructure Video

  • Home » Disaster Recovery » DR should never be “typical” — so why are so many companies getting it wrong?

Product Overview

DR should never be “typical” — so why are so many companies getting it wrong?
Posted January,09,2007 by Alan Chhabra
Disaster recovery strategies are not new. In fact, many CIOs and IT Managers have DR mandates by executive management. The ever growing possibility of severe weather, power failures, terrorist attacks, pandemics, or even a fire or HVAC leaks in the datacenter are all driving this focus. Business Continuity Planning (BCP) initiatives are fairly large efforts with infrastructure DR being one of the key components. Here at Egenera, we help guide our customers with some requirements for a sound DR strategy:
  1. Consider Total Cost of Ownership — Work within a budget but keep true TCO measures (both capital and operational) at the forefront
  2. Don’t Short Training — With proper training, all staff should be able to follow the plan and ensure success of the solutions without the need to grow the staff
  3. Process, Process, Process — The solution must be able to change as hosted application services and server configurations change. If you add new servers it shouldn't take months or weeks to ensure they can be replicated in case of a disaster
  4. Foolproof Technical Solution — The solution has to work each time, every time, or the business could be at risk. For example, if traders can't trade because primary systems are offline for one hour what's the cost? What will the press say? Can we ever really recover?
The "typical" DR solution is to replicate all the servers, switches, routers, NICs, HBAs, cabling, storage arrays, firewalls, load balancer, monitors, backup equipment, KVMs, etc. physically from the primary datacenter to another backup datacenter. Meaning both datacenters are kept looking exactly the same but the DR site sits idle or "cold" waiting for a DR failover from the primary site to occur. Not to mention the complexity this introduces, it's a tremendous drain on the IT staff. Did I mention the cost? I would bet against it working 10 out of 10 times. Is 5 out of 10 good enough? Here are few examples of the pain with typical DR solutions:
  • Imagine replicating 1000+ SAN and network cables between two sites. It's hard enough keeping track of new cables run every night for new servers in the datacenter. Now the enterprise has to make sure they get run twice, in separate locations. Many of these cables plug into switches, that plug into other switches, that plug into firewalls, that plug into routers etc. If a change is made in the primary site, it needs to be made at the DR site...any and every change. You get the point, it's a mess. I am not sure what you call replicating a mess and creating another mess. But let's just say it isn’t pretty...
  • Many servers have local physical disks in them. Many times IT has application data living on those disks. How do you replicate the data that lives on a physical disk to another site miles and miles away? Many datacenters copy the data to local tape that is then shipped off and stored in a third-party secure warehouse. If you have ever tried to restore 100s of GBs of data from tape you understand the pain with this process. Not to mention you have to find the right tapes and get those tapes to your backup site.
  • Idle servers are bad. They sit there and use up valuable real estate. Many times it's easy to forget to update their firmware or patch the OS they run. They have hardware failures but no one knows since they're not actively used or tested. Did I mention they cost money while collecting dust?
  • If you were to buy two servers from one of the big three vendors (IBM, Dell, HP) that were identical models but were ordered one month apart, you may have different hardware drivers. If this is the case, even if you were to move your application from one server to another (same model), your application may not work because of the hardware driver difference.
Bottom line: By going with the typical DR solution you are multiplying your physical datacenter complexity by a factor of two and worst of all adding twice the management complexity without really meeting any of the four objectives above. There is hope though! The "alternative" DR solution is to leverage as much virtualization as possible to remove the physical complexity.
  • Offload as much of the data away from local physical disks and move to SAN or NAS shared virtual devices. While the SAN or NAS storage costs may rise slightly, paying for more shared drive space is much better for TCO when compared to the typical hairball I detailed earlier. If all your application data lives on a SAN you can use "cool" technologies like SRDF and other mirroring technologies to replicate between two different SANs in two different datacenters. We just eliminated most of, if not all of, the need for tape for DR.
  • Consolidate I/O into shared fabric switches so there is a massive reduction in physical cabling, NICs and HBAs. We just removed the necessity to check and re-check massive cabling changes between both sites.
  • Remove the physical ties between the application, the OS and the hardware (CPU, memory) it uses. Ideally, the hardware should be stateless, and only maintains state when an application and OS needs it. The hardware goes into a pool of resources that can be used for any application, anytime, just-in-time. We just removed the whole driver problem from the hardware stack...not to mention an application can run on a 4-way AMD in production and can be moved to a 2-way Intel in case of a disaster without any issues.
  • Reduce the number of physical servers you have by leveraging hypervisors. Server virtualization means less physical "stuff" to replicate. One needs to be careful though—enterprises need flawless management software in order to do this without creating even more complexity. Things can get lost if you can't see them on the datacenter floor. Good management software is the key that allows hypervisors to help you big time.
  • Invest in provisioning software that allows for the provisioning and re-provisioning of stateless hardware to occur in the matter of minutes versus weeks or months. For example in the mainframe world one can re-provision and provision an LPAR. The MIPS are shared between different LPARs so your MIPs can be reallocated to needy applications when needed. The LPAR of 500 MIPs or so could run a COBOL app one day and the next day could be used to run DB2. The Mainframe solution for DR is very good BUT the costs of mainframes are the problem.
  • Mainframe like solutions (w/o the price tag) eliminate the "cold" equipment issue in the DR site. Use that equipment for development, staging, QA, or even alternate production applications. You can do this only if you acquire and develop the tools and processes to re-provision the environment in a matter of minutes/hours if a disaster were to occur.
Alternative solutions for DR are available that work 10 out of 10 times. Egenera and a few other vendors have solutions that you may want to consider. They can help you meet all 4 requirements above. I outlined some of characteristics to look for when you select your solution. If followed, it most likely will result in your organization thinking differently about the hardware you purchase. Remember it's your business. You could be at risk if you rely on the typical 2 X complexity solution. It may appear to be the easiest at first and least disruptive to your organization’s way of thinking, but many of have tried and have failed in their attempt of establishing a reliable DR solution. There are alternatives out there that can help you accomplish your goals.

Leave a Reply