Why Recovery Time Objectives (RTOs) Can Be Misleading

“Are you on the business side or the IT side?” was a question I received maybe a half dozen times last month while I was attending the Disaster Recovery Journal Fall World in San Diego.  This question really got me thinking—everyone at the conference worked in business continuity (BC) and/or disaster recovery (DR), but there was a definite divide between those who reported into IT departments and those who reported into the business. For the most part, these divisions fell along the lines of those who reported into IT had a DR focus and those who reported into the business (or perhaps into security and risk) had a BC focus. Attending the different breakout sessions across both domains I noted the good news: both groups speak the same language: RTO, RPO, availability, downtime, resilience, etc. The bad news is that I’m not sure we’re all using the same dictionary.

Two of the business-focused sessions I attended pointed out a troubling difference in the way IT and the business interpret one of the simplest of BC/DR terms: RTO. What is RTO? Simply put, it is the time to recover a service after an outage. This seems straightforward enough, but let’s breaks out how a business and an IT professional might understand RTO:

  • Business: The maximum amount of time that my service can be unavailable.
  • IT: The amount of time it takes to recover that service.

The difference between these two interpretations is that IT does not normally account for the time is takes to declare a disaster. As soon as a service goes down, do you immediately declare a disaster? No, you would most likely spend some time trouble shooting it before deciding that a disaster needs to be declared. Even then, IT most often does not have the power to declare the disaster, the decision may be up to an executive, thus adding more time before a declaration occurs and IT can actually start the work of recovery.

The enlightened track presenters both suggested that BC/DR professionals separate out time to declaration from recovery time, to avoid miscommunication, missed SLA’s and bad blood between IT and the business. I think this is an interesting idea, so I’m curious to know if any of you out there are measuring RTO in a segmented way like that?

PS, SHAMELESS PLUG FOR MY OWN RESEARCH: We are currently running a survey on disaster recovery with the DRJ, I encourage you to take our survey for a chance to win a free pass to DRJ Spring World in Orlando where I'll be presenting the findings!

Comments

RTO is an objective time

The misleading concept you refer actually is not RTO, but Business Continuity and IT continuity.

RTO is an objective time, no relate to the declare starting time, it can traced from log file or automatically record by monitoring system if installed. If argue for SLA, IT should show business this objective data.

RTO should agreed by both business and IT sides, shorten RTO, more money!

Regards,

Thomas

RTO

I would say that your first term, "The maximum amount of time that my service can be unavailable." is not RTO, but rather MTPD, the "Maximum Tolerable Period of Disruption". MTPD is a Business Continuity measure, RTO is a Disaster Recovery measure.

MTPD comes out of the analyses for BCM, and is one of the input values used to define RTO in the IT DR plan, it will always be > RTO for exactly the reasons you mention.

Steve

Rachel Eu sou um consultor

Rachel

Eu sou um consultor brasileiro de gestão de continuidade de negócios, e fico feliz por alguem ter levantado esta discução pois realmente existe esta diferenciação, e me deparei com ela na prática certa vez.

Primeiramente ressalto que realmente o tempo de acionamento de uma resposta de continuidade e/ou recuperação DEVE ser considerado no RTO, pois senão o mesmo estará furado! Mas acredito que este tempo de acionamento poderá ser identificado de acordo com o aumento da maturidade do processo de BCM.

Outro ponto que defendo é que o négocio é quem define o RTO para a TI ou qualquer outro recurso necessário, como pessoas, ambientes, fornecedores...

Ai é que entra a tua constatação, pois quizermos identifcar o RTO de determinado processo, devemos fazer a seguinte questão: Quanto tempo de indisponibilidade é aceitável?

Com essa resposta teremos o RTO do processo que deverá ser repassado aos seus recursos, mas com a diferença de tempo para acionamento da recuperação destes recursos, que poderá ser identifica com os tecnicos reponsaveis, através da seguinte questão: Quanto tempo você consegue recuperar este recurso se determinado incidente ocorrer?

Espero que todos tenham entendido, e peço desculpas pelo meu pobre ingles. :)

Aproveitando este tempo, gostaria de solicitar uma ajuda, pois estou iniciado meu trabalho de conclusao de curso e estou pensando em fazer ele com foco em RTO dinamico. Alguem já viu algum artigo ou literatura que aborda este assunto?

Grato pela atenção de todos!

Rachel I am a Brazilian

Rachel

I am a Brazilian management consultant business continuity, and I'm glad someone raised this misogyny because there really is this difference, and I came across it once in practice.

First I point out that really the time to drive a continued response and recovery should be considered in RTO, because otherwise he will be bored! But I believe that this time the drive can be identified according to the increasing maturity of the BCM process.

Another point that I advocate is that the business is who sets the RTO for IT or any other action necessary, as people, environments, suppliers ...

Woe is you enter your observation, because you please the RTO being identified certain process, we must ask the following question: How much downtime is acceptable?

With this response we have the RTO process that should be passed on to their resources, but with the difference in time to drive the recovery of these resources, which may be identified with the persons responsible technicians through the following question: How long can you recover this feature if an incident occurs?

I hope everyone understands, and I apologize for my poor english. :)

Taking advantage of this time, I would ask for a help, because I started my work of completion of course and I'm thinking about doing it with focus on dynamic RTO. Anyone seen any articles or literature that addresses this issue?

Grateful for the attention of all!

RTO

Rachel -

You pose a great question. Understanding the components of RTO during an outage can help focus a company on ways to reduce each of these components. Specifically they are:

1. Detection of the failure/outage
2. Notification of IT of the outage
3. Development of an action plan by IT
4. Time to convene recovery team
5. Time to recover the sustaining systems and infrastructure
6. Time to recover the primary system
7. Time to notify end users of the recovered system
8. Time for end users to reconnect to the recovered system

Together these components represent the downtime associated with a system outage. Each component can be separately addressed, however. For example, monitoring/alerting systems can reduce the time to detect a failure (1) - alerting a tech at 3 a.m. of the failure instead of receiving a call at 9 a.m. as users try to access the system. In addition, documented, system-specific action plans for critical systems can help to reduce the time needed to assign responsibilities for recovery operations (3).

I really enjoyed your post.

Thanks,
-Erik

RTO and MPTD

Hi there,
My Name is Chris and I'm a Product Manager with a certification body. Over here, we do certification for SS 540:2008, which is a Singapore standard for Business Continuity Management (BCM).

In my humble opinion, Max Tolerable Period of Disruption (MPTD) is a time whereby business is going to suffer tremendously if its business services are not recovered within the MPTD timeframe.

on the other hand, Return Time Objective (RTO) is defined in SS 540 as the period of time within which systems, applications or functions must be recovere after a disruption has occurred. Note that RTO is measured from the time when a disaster occurs (not when it is declared) to the time the businesses are recovered to an level acceptable by the organization.

Essentially, RTO should be equal or less than MPTD. It is fine if an organization wants to have a stringent RTO but it does not make business sense if RTO > MPTD.

What do you guys think?

Thanks!

Regards,
Chris

Maximum Tolerable Period of Disruption (MTPD)

Chris, Steve, you both raise a good point that what I describe above as RTO in business terms is really what you would term MTPD. I do agree with you, but I think the larger problem is the widespread misconception (especially in IT departments) that RTO = MTPD. How do we bridge this gap?

Chris, I think it's interesting that you see RTO as the time from when the disaster occurs... I would assume that in many cases once an event occurs, depending on it's magnitude, the decision must be made whether to repair the system (if possible) or to invoke the DR plan. If this decision isn't made immediately, that will elongate the recovery time and possibly mean that the MTPD is missed.

So I guess my suggestion is an equation of sorts:

MTPD = RTO (the amount of time it takes to restore a system once work commences until that system is fully operational) + DDP (Disaster declaration period, the amount of time that it takes for a disaster to be declared and recovery work to commence)

The MTPD is defined by the business and the RTO and DDP are SLAs that are entered into by IT and the business (or whoever is in charge of declaring a disaster)

Thoughts?

Hi Rachel, This is the sort

Hi Rachel,

This is the sort of discussion that would be easier in front of a whiteboard :)

I think that we're agreed on MTPD, but I would agree with Chris that RTO is counted from the moment of disaster. It's a measure of the maximum allowed _system_ outage time, starting at the instant those systems went down. Once the systems are restored, though, the recovery process for the service/business is still not necessarily complete. System users may have to log-in, re-establish sessions, and otherwise restart the business processes that depend on having the systems available. That could take anything from a few minutes, to several hours. That's the gap between MTPD and RTO.

If you start from MTPD as the driving factor, it should be possible to calculate that business recovery period (is there a formal term for that? BRP?) based on knowledge of the business. (MTPD - BRP) then gives the RTO that must be met, and I would consider that derived RTO as including both the decision period (repair/restart, or switch to DR backup) and the time to execute that decision.

I suppose the difference here is that I count the disaster discovery period (the time it takes for someone to notice, and hit that big red emergency button) and the disaster declaration process, as being part of the recovery process.

Since there is a risk of confusion, it would probably be good practice to make sure that a business continuity plan clearly defines the terms as used, for any given plan. In the commercial world I've seen marketing literature that plays fast & loose with this. I've seen an HA product that claims sub-second 'recovery' times from a fault, but
neglects to count the several seconds taken to detect the fault and begin that sub-second recovery process. Clearly from the users' point of view the total time is of more importance.

Steve

RTO is not the only misunderstood term

Rachel,

I worked on the ANSI/ASIS/BSI BCM.1-2010 standard which is in process of being published by ANSI. In our work we used the BS25999 as the basis with aspects from industry representation from major contributors in the US and over 100 other countries. The major hurdle we experienced is the terms used to define points of recovery.
Because of the many different interpretations we focused only on 2 - RTO and RPO. We implied what BS25999 used "Maximum tolerable period of disruption" (MTPD) as the "Maximum Time Down" (MTD) but did not define it as such. On the up side both RTO and RPO were fully defined and agreed by the committee with a loose MTD interpretation. Here are those interpretations:

Maximum Allowable Disruption and Recovery Time Objective:
The maximum allowable time (or maximum tolerable period of disruption) identifies the point at which the organization’s viability is threatened if the delivery of each product and service is not resumed. Top management can then set a recovery time objective for each product and service within this maximum time based on their assessment of the increasing impacts over time.
Once these times for delivery are established the organization should assign recovery time objectives to each organizational activity that contributes—directly or indirectly—to the delivery of the product or service based on:
• the role and timescale of each activity in service delivery
• management’s guidance regarding disruption tolerance for each activity;
• current and future-state strategic imperatives; and
• the interdependencies between activities and with external suppliers
• the currency of information required to undertake each activity is identified
Recovery time objectives are used to prioritize recovery efforts and the use of recovery resources. Recovery point objectives are used to determine an appropriate back-up strategy for information. These terms are applicable to all disciplines and are not exclusive to informational technology and data; and can be applied to other capabilities.

Definitions:
recovery time objective - period of time after which it is planned to recover each activities and resources to an acceptable capability after a disruptive event. This may be a simple resumption of full service or a phased return over a period.
recovery point objective - point in time to which data or capacity of a process is in a known and valid or integral state can be restored from. This should be less than the maximum amount of loss tolerance and may be defined in hours or days.

I hope this helps.

The full standard is available from ASIS (www.asisonline.org).
Here is the link to the press release: http://www.asisonline.org/newsroom/pressReleases/2010-12-14_BCMstandard.doc

Business Continuity Management

Hi guys,
I have just done a seminar on SS 540:2008 the Singapore standard for Business Continuity Management and its relationship to ISO 27001 standard. The seminar is organized by IT standards committee(ITSC) of Infocomm Development Authority Of Singapore (IDA). I'm also a working group member of the ISO 27001 workgroup.

During the seminar, I have also performed a high-level mapping between SS 540 and BS 25999. If you have a copy of SS 540 standard, you will know that once you have covered all requirements of SS 540, which is more comprehensive, you will cover BS 25999 requirements as well.

In the SS 540 standard, apart from the common BCM terms used like RTO, RPO, we also use a term called MBCO (Minimum Business Continuity Objective). MBCO is basically the minimum level of services that are accepted by management to attain its business objectives when a disruption has occurred. This distuption can be caused by incident, emergency, disaster, etc. An MBCO has to be defined at organizational level and also at the Business Unit (BU) level, aligning back with the organization's business objectives.

As mentioned and also agreeable by Steve, the RTO is measured from the time when a disaster occurs (not when it is declared) to the time the businesses are recovered to an level acceptable by the organization. Just to reiterate, the Return Time Objective (RTO) is defined in SS 540 as the period of time within which systems, applications or functions must be recovered after a disruption has occurred.

For the SS 540 standard, each BU will need to define its RTO and RPO, aligning back with its BU's level of MBCO. The decision on the value of the BU's RTO and RPO depends on a lot of factors, one of those is to determine if it is a Critical Business Funtion (CBF).

In summary, the organization has to decide if it needs to declare as disaster and activate the Business Continuity (BC) plan within the RTO timeframe i.e. it needs to factor in the decision time for disaster declaration, within RTO. This itself will include the time needed for Damage Assessment by the Damage Assessment Team (DAT).

As RTO <= MPTD, failing to recover within the required timeline will render the organization to have significant damage and loss, as defined by the individual organization.

Hope the inputs help and Happy New Year to everyone! ;>

Thanks!

Regards,
Chris Ng
Product Manager/Lead Auditor
TUV SUD PSB

Declaration of a disaster and where does it fall within a BCP

I would highly recommend that IT DR be integrated into the overall BCP. The greatest challenge that I have seen in years is the BCP team collaborating with the IT teams. They dont speak the same language and secondly lack the capability to integrate various components of a BCP (e.g. ER Plans with Security with ITDR with Incident Management, etc.). Being blunt allot of BCP coordinators that I have seen dont have enough experience and greatly lack IT and information management knowledge which is essential. RTO and RPO should include the response, triage, escalation, contingency plans and recovery plans(while operating at Minimum Service Level) and all the information gathered from the BIA should help you in developing the above.

when you get an answer from a service provider on the RTO make sure you account for the escalation, recovery strategies, contingency plans ,etc. to be incorporated in meeting such a time frame because they dont care what you do in the background to make it happen and they are defining the business requirements which ultimately gives you a time you have to work with to ensuring service is capable of being delivered.

I hope this helps.