During the last 12 to 18 months, there have been a number of notable natural catastrophes and weather related events. Devastating earthquakes hit Haiti, Chile, China, New Zealand, and Japan. Monsoon floods killed thousands in Pakistan, and a series of floods forced the evacuation of thousands from Queensland. And of course, there was the completely unusual, when for example, ash from the erupting Eyjafjallajökull volcano in Iceland forced the shutdown of much of Western Europe’s airspace. These high profile events, together with greater awareness and increased regulation, have renewed interest in improving business continuity and disaster recovery preparedness. Last quarter, I published a report on this trend: Business Continuity And Disaster Recovery Are Top IT Priorities For 2010 And 2011.
Each year for the past three years I've analyzed and written on the state of enterprise disaster recovery preparedness. I've seen a definite improvement in overall DR preparedness during these past three years. Most enterprises do have some kind of recovery data center, enterprises often use an internal or colocated recovery data center to support advanced DR solutions such as replication and more "active-active" data center configurations and finally, the distance between data centers is increasing. As much as things have improved, there is still a lot more room for improvement not just in advanced technology adoption but also in DR process management. I typically find that very few enterprises are both technically sophisticated and good at managing DR as an on-going process.
When it comes to DR planning and process management, there are a number of standards including the British Standard for IT Service Continuity Management (BS 25777), other country standards and even industry specific standards. British Standards have a history of evolving into ISO standards and there has already been widespread acceptance of BS 25777 as well as BS 25999 (the business continuity version). No matter which standard you follow, I don’t think you can go drastically wrong. DR planning best practices have been well defined for years and there is a lot of commonality in these standards. They will all recommend:
If you still subscribe to fixed site recovery services using shared IT infrastructure from the likes of HP, IBM BCRS, or SunGard, among others, you will quickly become a dinosaur in the next 1 to 2 years.
These types of shared infrastructure services involve lengthy restores from tape and a recovery time objective of 72 hours, at best. Plus, you'll be lucky if you recover at all because chances are, you've had trouble scheduling a test with your service provider and it's been a LONG time since the last one, if indeed you’ve ever tested.
72 hours recovery just doesn't cut it anymore. And frankly, understanding your provider's oversubscription ratio to shared infrastructure to determine the risk of multiple invocations, or attempting to negotiate exclusions zones and availability guarantees is a time suck. Most companies are either taking DR back in-house or, if they still rely on a DR service provider, they are using dedicated infrastructure.
TechCrunchIT reported today that a Rackspace data center went down for several hours during the evening due to a power grid failure. Because Rackspace is a managed service provider (MSP), the downtime affected several businesses hosted in the data center.