Recent outages from Amazon and Google have got me thinking about resiliency in the cloud. When you use a cloud service, whether you are consuming an application (backup, CRM, email, etc), or just using raw compute or storage, how is that data being protected? A lot of companies assume that the provider is doing regular backups, storing data in geographically redundant locations or even have a hot site somewhere with a copy of your data. Here's a hint: ASSUME NOTHING. Your cloud provider isn't in charge of your disaster recovery plan, YOU ARE!
Yes, several cloud providers are offering a fair amount of resiliency built in, but not all of them, so it's important to ask. Even within a single provider, there are different policies depending on the service, for example, Amazon Web Services, which has different policies for EC2 (users are responsible for their own failover between zones) and S3 (data is automatically replicated between zones in the same geo). Here is a short list of questions I would ask your provider about their resiliency:
Can I audit your BC/DR plans?
Can I review your BC/DR planning documents?
Geographically, where are your recovery centers located?
In the event of a failure at one site, what happens to my data?
Can you guarantee that my data will not be moved outside of my country/region in the event of a disaster?
What kinds of service-levels can you guarantee during a disaster?
What are my expected/guaranteed recovery time objective (RTO) and recovery point objective (RPO)?
As a follow-up to my blog post yesterday, there’s another area that’s worth noting in the resurgence of interest in BC preparedness, and that’s standards. For a long time, we’ve had a multitude of both industry and government standards on BCM management including Australian Standards BCP Guidelines, Singapore Standard for Business Continuity / Disaster Recovery Service Providers (which became much of the foundation for ISO 24762 IT Disaster Recovery), FFIEC BCP Handbook, NIST Contingency Planning Guide, NFPA 1600, BS 25999 (which will become much of the foundation for the soon to be released ISO 22301), ISO 27031, etc. There are also standards in other domains that touch on BC, security standards like ISO 27001/27002.
And when you come down to it, several of the broad risk management standards like ISO 31000 are applicable. At the end of the day, the same risk management disciplines underpin BC, DR, security and enterprise risk management. You conduct a BIA, risk assessment, then either accept, transfer or mitigate the risk, develop contingency plans, and make sure to keep the plans up to date and tested.
In my most recent research into various BCM software vendors and BC consultancies, as well as input from Forrester clients, BS 25999 seems to be the standard with the most interest and adoption. In the US at least, part of this I attribute to the fact that BS 25999 is now one of the recognized standards for US Department of Homeland Security’s Voluntary Private Sector Preparedness Accreditation and Certification Program. The other standards are NFPA 1600 and ASIS SPC.1-2009. I’ve heard very few Forrester clients mention the latter as their standard.
During the last 12 to 18 months, there have been a number of notable natural catastrophes and weather related events. Devastating earthquakes hit Haiti, Chile, China, New Zealand, and Japan. Monsoon floods killed thousands in Pakistan, and a series of floods forced the evacuation of thousands from Queensland. And of course, there was the completely unusual, when for example, ash from the erupting Eyjafjallajökull volcano in Iceland forced the shutdown of much of Western Europe’s airspace. These high profile events, together with greater awareness and increased regulation, have renewed interest in improving business continuity and disaster recovery preparedness. Last quarter, I published a report on this trend: Business Continuity And Disaster Recovery Are Top IT Priorities For 2010 And 2011.
“Are you on the business side or the IT side?” was a question I received maybe a half dozen times last month while I was attending the Disaster Recovery Journal Fall World in San Diego. This question really got me thinking—everyone at the conference worked in business continuity (BC) and/or disaster recovery (DR), but there was a definite divide between those who reported into IT departments and those who reported into the business. For the most part, these divisions fell along the lines of those who reported into IT had a DR focus and those who reported into the business (or perhaps into security and risk) had a BC focus. Attending the different breakout sessions across both domains I noted the good news: both groups speak the same language: RTO, RPO, availability, downtime, resilience, etc. The bad news is that I’m not sure we’re all using the same dictionary.
Two of the business-focused sessions I attended pointed out a troubling difference in the way IT and the business interpret one of the simplest of BC/DR terms: RTO. What is RTO? Simply put, it is the time to recover a service after an outage. This seems straightforward enough, but let’s breaks out how a business and an IT professional might understand RTO:
Business: The maximum amount of time that my service can be unavailable.
IT: The amount of time it takes to recover that service.
Over the past several months, I've been receiving a lot of questions about replication for continuity and recovery. One thing I've noticed, however, is that there is a lot of confusion around replication and its uses. To combat this, my colleague Stephanie Balaouras and I recently put out a research report called "The Past, Present, And Future Of Replication" where we outlined the different types of replication and their use cases. In addition to that, I thought it would be good to get some of the misconceptions about replication cleared up:
Myth: Replication is the same as high availability Reality: Replication can help to enable high availability and disaster recovery, but it is not a solution in and of itself. In the case of an outage, simply having another copy of the data at an alternate site isn't going to help if you don't have a failover strategy or solution. Some host-based replication products come with integrated failover and failback capabilities.
Myth: Replication is too expensive Reality: It's true that traditionally array-based replication has been expensive due to the fact that it requires like-to-like storage and additional licensing fees. However, two factors have mitigated this expense: 1) several storage vendors are no longer charging an extra licensing fee for replication; and 2) there are several alternatives to array-based replication that allow you to use heterogeneous storage and come at a significantly lower acquisition cost. Replication products fall into one of four categories (roughly from most to least expensive):
During Interop, I attended two sessions on disaster recovery and backup in the virtual world, topics that are near and dear to my heart and also top of mind for infrastructure and operations professionals (judging by the number of inquiries we get on those topics). First up was How Virtualization Can Enable and Improve Disaster Recovery for Any Sized Business which was very interesting (and very well attended). The panel was moderated by Barb Goldworm, President and Chief Analyst, FOCUS, and the panelists were: George Pradel, Director of Strategic Alliances, Vizioncore; Joel McKelvey, Technical Alliance Manager, NetApp; Lynn Shourds, Senior Manager, Virtualization Solutions, Double-Take Software; and Azmir Mohamed, Sr. Product Manager, Business Continuity Solutions, VMware.
Barb kicked off the session with some statistics on disaster recovery that can help people build the business case for it: 40% of business that were shut down for 3 days, failed in 3 years. She also cautioned that you have to test DR regularly and under unexpected circumstances.
Each year for the past three years I've analyzed and written on the state of enterprise disaster recovery preparedness. I've seen a definite improvement in overall DR preparedness during these past three years. Most enterprises do have some kind of recovery data center, enterprises often use an internal or colocated recovery data center to support advanced DR solutions such as replication and more "active-active" data center configurations and finally, the distance between data centers is increasing. As much as things have improved, there is still a lot more room for improvement not just in advanced technology adoption but also in DR process management. I typically find that very few enterprises are both technically sophisticated and good at managing DR as an on-going process.
When it comes to DR planning and process management, there are a number of standards including the British Standard for IT Service Continuity Management (BS 25777), other country standards and even industry specific standards. British Standards have a history of evolving into ISO standards and there has already been widespread acceptance of BS 25777 as well as BS 25999 (the business continuity version). No matter which standard you follow, I don’t think you can go drastically wrong. DR planning best practices have been well defined for years and there is a lot of commonality in these standards. They will all recommend:
If you still subscribe to fixed site recovery services using shared IT infrastructure from the likes of HP, IBM BCRS, or SunGard, among others, you will quickly become a dinosaur in the next 1 to 2 years.
These types of shared infrastructure services involve lengthy restores from tape and a recovery time objective of 72 hours, at best. Plus, you'll be lucky if you recover at all because chances are, you've had trouble scheduling a test with your service provider and it's been a LONG time since the last one, if indeed you’ve ever tested.
72 hours recovery just doesn't cut it anymore. And frankly, understanding your provider's oversubscription ratio to shared infrastructure to determine the risk of multiple invocations, or attempting to negotiate exclusions zones and availability guarantees is a time suck. Most companies are either taking DR back in-house or, if they still rely on a DR service provider, they are using dedicated infrastructure.
TechCrunchIT reported today that a Rackspace data center went down for several hours during the evening due to a power grid failure. Because Rackspace is a managed service provider (MSP), the downtime affected several businesses hosted in the data center.