Recent outages from Amazon and Google have got me thinking about resiliency in the cloud. When you use a cloud service, whether you are consuming an application (backup, CRM, email, etc), or just using raw compute or storage, how is that data being protected? A lot of companies assume that the provider is doing regular backups, storing data in geographically redundant locations or even have a hot site somewhere with a copy of your data. Here's a hint: ASSUME NOTHING. Your cloud provider isn't in charge of your disaster recovery plan, YOU ARE!
Yes, several cloud providers are offering a fair amount of resiliency built in, but not all of them, so it's important to ask. Even within a single provider, there are different policies depending on the service, for example, Amazon Web Services, which has different policies for EC2 (users are responsible for their own failover between zones) and S3 (data is automatically replicated between zones in the same geo). Here is a short list of questions I would ask your provider about their resiliency:
Can I audit your BC/DR plans?
Can I review your BC/DR planning documents?
Geographically, where are your recovery centers located?
In the event of a failure at one site, what happens to my data?
Can you guarantee that my data will not be moved outside of my country/region in the event of a disaster?
What kinds of service-levels can you guarantee during a disaster?
What are my expected/guaranteed recovery time objective (RTO) and recovery point objective (RPO)?
One thing that I’ve found in common across infrastructure and operations groups of all shapes and sizes is that they are continually searching for the ideal set of key performance indicators. A set of metrics that perfectly measures their infrastructure, demonstrates the excellence of their operations, but are still simple and cheap to collect. At least once a week I speak with a client searching for the holy grail of metrics, hopeful that I hold that coveted knowledge. They’re inevitably disappointed to find out that I don’t know what the best set of metrics is, and that I truly think that it doesn’t exist! Sorry if I’m bursting your bubble, but there is no essential set of metrics for all infrastructure and operations organizations. What makes sense for one organization to measure may be completely useless for another organization. What may be very simple to collect at one company is nearly impossible at another.
While I don’t believe in the myth of a single set of perfect metrics for all organizations, I do think it is valuable to learn from other organizations what they are measuring in order to compare them to your own metrics (and maybe steal some of theirs), which is why I am gathering a list of metrics from infrastructure and operations groups globally in order to form a database of metrics. Once we have a good number of metrics on this list, I will work to consolidate them down to the most commonly cited metrics and collect a benchmark on them. We’re calling this project “Forrester's Consensus Metrics For Infrastructure & Operations” and I really hope you’ll consider contributing to it because we can’t do this without your input.