Disaster Recovery And Backup In A Virtual World: Key Takeaways For I&O Professionals From Interop Las Vegas 2010

During Interop, I attended two sessions on disaster recovery and backup in the virtual world, topics that are near and dear to my heart and also top of mind for infrastructure and operations professionals (judging by the number of inquiries we get on those topics). First up was How Virtualization Can Enable and Improve Disaster Recovery for Any Sized Business which was very interesting (and very well attended). The panel was moderated by Barb Goldworm, President and Chief Analyst, FOCUS, and the panelists were: George Pradel, Director of Strategic Alliances, Vizioncore; Joel McKelvey, Technical Alliance Manager, NetApp; Lynn Shourds, Senior Manager, Virtualization Solutions, Double-Take Software; and Azmir Mohamed, Sr. Product Manager, Business Continuity Solutions, VMware.

Barb kicked off the session with some statistics on disaster recovery that can help people build the business case for it: 40% of business that were shut down for 3 days, failed in 3 years. She also cautioned that you have to test DR regularly and under unexpected circumstances.

  • Barb: What are the most common mistakes you see customers making when trying to set up disaster recovery plans?
    • George: The first step is to understand RTO (recovery time objective) and RPO (recovery point objective). He describes RTO as looking ahead through the windshield: how long will it take for me to recover, and  RPO as like looking in the rearview mirror: once I have recovered, how much data loss will I have sustained? (I really like this analogy). One of the most common mistakes that he sees is that when people tier their servers, they forget about the interdependencies. One way to combat this is to look at things from a service level and holistic format, not individual technologies. The other major mistake is not getting executive buy-in.
    • Joel:  The most common mistake Joel see is when DR is thought of after the purchase. A lot of the time customers buy technology based on CAPex, and devalue technology based on OPex. And they almost never consider the costs of DR for a certain technology (I’m thinking this may warrant a new term: DRex!)
    • Lynn: A common mistake that he sees is infrastructure and operations professionals not setting realistic expectations with the business units about what they can and cannot achieve in terms of RPOs and RTOs. Another common mistake is not differentiating between DR and high availability (HA). DR is resiliency across sites, HA is resiliency within a single site.
    • Azmir: The most common mistake is not using VMware (haha, just kidding). Make sure you do your homework, understand what you want to do with RPO and RTO and whether you will implement your DR inhouse or outsourced before you start. The other major problem is people don’t rehearse or test... when the day occurs, it doesn’t matter how much you spent, it will all go down the toilet if you haven’t tested it.
  • Barb: What is the future of DR and virtualization?
    • George: Backup and replication tools are becoming more and more integrated (both Vizioncore and CA have announced a convergence of their backup and replication tools).
    • Joel: Tape backup for operational recovery is going away. “Tape is the roach motel of data: data goes in, it doesn’t come out.”
    • Azmir: DR is no longer thought of as being event driven -- if you think of it as event driven, it becomes a very expensive insurance policy. Enterprises should be making DR part of their everyday processes, and using DR technologies on a day-to-day basis, for example, for maintenance. 

On Day two, George Pradel gave another session on Backing up Your Virtual Environment -- Best Practices.  The biggest obstacle to virtualization adoption is backup challenges. Backup needs to be rethought in a virtual environment… and backup is one of the processes that no one wants to change. Backup windows are challenging; some people only have a 4 hour backup window.

One way to backup your virtual environment is to use image-based backups, which means that the entire system is encapsulated and can be recovered elsewhere (which obviously can help with DR as well).  These kinds of backups can be done direct-to-target, or with a proxy server. At the end of the deck George had some best practices for backing up virtual machines for small, medium, and large environments:

  • Small environments
    • Implement image-based backups to offset traditional costs
    • Static servers
      • Weekly or bi-weekly full image backups
    • Dynamic servers
      • Weekly or bi-weekly full image backups
      • Incremental or differential daily
    • Line of business servers
      • Weekly or bi-weekly full image backups
      • Incremental or differential daily
      • Software replication on or off-site
    • Scrape long-term storage to tape regularly
    • Investigate cloud storage for long-term storage
  • Midsize environments
    • Consider recovery SLA requirements per workload/application
    • Displace traditional agents to save costs
    • “P2V” Disaster Recovery
    • Static Servers
      • Weekly full image backup
    • Dynamic servers
      • Weekly or bi-weekly full image backups
      • Incremental or differential daily
    • Line of business servers
      • Weekly or bi-weekly full image backups
      • Incremental or differential daily
      • Storage array-based snapshots and replication on and off-site
    • Secondary storage costs become a concern for long-term data storage
  • Large environments
    • Consider recovery SLA requirements per application
    • Dynamic mix of technologies to meet defined SLAs
    • Infrastructure server/Tier 3 applications
      • Weekly or bi-weekly full image backups
    • Tier 2 applications
      • Monthly/bi-weekly image backups
      • Incremental or differential daily
      • Regular storage snapshots
      • Software replication on or off-site
    • Tier 1 applications
      • Monthly/bi-weekly image backups
      • Incremental or differential daily
      • Regular storage snapshots
      • Storage-array replication on or off-site