Practical ITSM Advice: Defining Availability For An IT Service

As a follow up to his presentation at the 2013 itSMF Norway conference, Stuart Rance of HP has kindly donated some practical advice for those struggling with availability.

Many IT organizations define availability for IT services using a percentage (e.g. 99.999% or “five 9s”) without any clear understanding of what the number means, or how it could be measured. This often leads to dissatisfaction, with IT reporting that they have met their goals even though the customer is not satisfied.

A simple calculation of availability is based on agreed service time (AST), and downtime (DT).

If AST is 100 hours and downtime is 2 hours then availability would be

Customers are interested in their ability to use IT Services to support business processes. Availability reports will only be meaningful if they describe things the customer cares about, for example the ability to send and receive emails, or to withdraw cash from ATMs.

Number and duration of outages

A service that should be available for 100 hours and has 98% availability has 2 hours downtime. This could be a single 2 hour incident, or many shorter incidents. The relative impact of a single long incident or many shorter incidents is different for different business processes. For example, a billing run that has to be restarted and takes 2 days to complete will be seriously impacted by each outage, but the outage duration may not be important. A web-based shopping site may not be impacted by a 2 minute outage, but after 2 hours the loss of customers could be significant. Table 1 shows some examples of how an SLA might be documented to show this varying impact.

Table 1 - Outage duration and maximum frequency

Outage duration

Maximum frequency

up to 2 minutes

2 events per hour

5 events per day

10 events per week

2 minutes to 30 minutes

2 events per week

6 events per quarter

30 minutes to 4 hours

4 events per year

4 hours to 8 hours

1 event per year

If we document availability like this, there are two big benefits.

  1. The numbers mean a lot more to the customer than simple percentage availability
  2. People designing infrastructure and applications have better guidance on what they need to achieve

Number of users affected

Most incidents don’t cause complete loss of service for all users. Some users may be unaffected, whilst others have no service at all. The extreme case is where a single user has a faulty client PC which prevents them making use of the service. Although we could class all of these as 100% loss of service, this would leave IT with an impossible goal, and would not be a fair measurement of availability.

The opposite extreme would be to say the service is available if any user can use it. This could lead to customer dissatisfaction if the service is unavailable for many users, whilst being reported as available!

 “User Outage Minutes” can be used to measure and report the impact on users. This is calculated by multiplying the number of affected users by the incident duration. This number is then compared to the potential number of user minutes of service to generate an availability figure.

In one recent engagement we calculated the availability of IP Telephony in a Call Centre in terms of Potential AgentPhoneMinutes and Lost AgentPhoneMinutes. For applications that deal with transactions or manufacturing a similar approach can be used to measure the number of lost transactions, or the extent of lost production, compared to the expected numbers.

Criticality of each business function

Most IT services support a number of business processes, some of these are critical and others are less important. For example an ATM may support cash dispensing and statement printing.

You can create a table that shows the impact on the service when each function is unavailable. Table 2 shows an example of how this might be documented.

Table 2 - Percentage degradation of service

IT function that is not available

% degradation of service

Sending email

100%

Receiving email

100%

Reading public folders

50%

Updating public folders

10%

Accessing shared calendars

30%

Updating shared calendars

10%

Note: Figures are not intended to add up to 100%

This table can help IT and the customer discuss and agree the relative importance of different functions, and can then be used to calculate service availability that reflects that importance.

Measurement period

It is essential to specify the time period over which measurement and reporting take place, as this can have a dramatic effect on the numbers.

Consider a service with 24*7 service and agreed availability of 99% that has an 8 hour outage:

  • If we report availability every week then AST is 24 * 7 hours = 168 hours
  • Measured monthly the AST is (24 * 365) / 12  = 730 hours
  • Measured quarterly the AST is (24 * 365) / 4 = 2190 hours

Putting these numbers into the availability equation gives:

  • Weekly availability = 100% * (168 – 8) / 168 = 95.24%.
  • Monthly availability = 100% * (730 – 8) / 730 = 98.9%
  • Quarterly availability = 100% * (2190 – 8) / 2190 = 99.6%

Each of these is a valid figure for the availability of the service, but only one of them shows that the target was met.

Planned downtime

One aspect of availability measurement and reporting that is often overlooked is planned downtime. Some SLAs are written so that planned downtime that occurs in a specific window is not included in availability calculations. For other customers, planned downtime that has been scheduled at least 4 weeks in advance is not counted.

Whatever choice is made the SLA must clearly define how planned downtime is to be included in calculations of availability.

Measuring availability

It is essential that the IT organization measures and reports availability in terms that can be compared to the targets. Some common approaches to measuring availability are:

  • Collect data at the service desk that identifies the business impact and duration of each incident. This is often fairly inexpensive to do, but may lead to disputes about the accuracy of the data.
  • Instrument all components required to deliver the service and calculate the availability based on understanding how each component contributes to the end-to-end service. This can be very effective, but may miss subtle failures, for example a minor database corruption could result in some users being unable to submit particular types of transaction.
  • Use dummy clients submitting known transactions from particular points on the network to identify when the service is functioning. This does actually measure end-to-end availability (and performance), but could under-report downtime caused by subtle failures.
  • Instrument applications to report end-to-end availability. This can actually measure end-to-end service availability, but the requirement must be included early in the application design.

In practice, a solution should use a combination of the above methods. This should be agreed with the customer and documented in the SLA.

My thanks to Stuart for kindly sharing his knowledge and experience.

Comments

A very good read! Together

A very good read! Together with real business services (and not it services) it will help a lot of companies.

Robert

Managing Expectations

Stephen,

My thanks to Stuart for this contribution. I enjoyed this blog post a lot and there are many excellent points here for consideration. I always appreciate it when there is more than a surface discussion on availability and I think this one gives most anyone something to think about.

The piece that I'd like to add in is one that I often add whenever a discussion of availability management comes up -- we must actively manage our customers expectations, not just pay attention or manage to the terms of the "formal agreement." It's possible to completely satisfy all of the measures in the formal agreement and still have a customer that is livid!

In the case of this example, there are clearly times when 30 minutes of downtime is unnoticeable, others it's annoying and still others when it's a major catastrophe. Of course, this varies by service as well. The bottom line is that it's all about impact to the affected user/community and the vital mission activity that they're pursuing.

Impact goes well beyond the service desk and/or setting flags on an incident ticket. We need to understand what impact actually means to both the customer (what their desired experience is) and to the service provider (what our response posture should be, what actions we will take and resources we'll commit).

If we don't ensure that it's part of an integrated conversation towards managing customer expectations, we might find that we take actions that are actually "above and beyond the call of duty," spend a lot of time and money -- and still fail in the eyes of our customer.

kengon
www.kennethgonzalez.com

Service expectation management

I have heard James Finister talk of service expectation management over service level management (hopefully I have not misquoted him here) ... it makes so much sense. IMO so many of the issues we have that aren't people/management/leadership-related stem from not understanding/setting/managing customer expectations.

"Caution: Lanes Merge Ahead"

Stephen,

I think this is a topical area which perfectly demonstrates why there is *no real split* between Outside-In (O-I) and Inside-Out (I-O) thinking. We really do need both!

What we see today is that we have a genetic bias towards I-O (from an industry-wide perspective) and we need to strengthen our Knowledge, Skills and Abilities around O-I. We need to drive that "maturity level" higher, before we really worry about (traditional) maturity on the I-O side of the continuum.

The good news is that taking tangible ground on the O-I side is relatively quick, easily and inexpensive, compared with taking the same amount of ground on the I-O side. The secondary benefit is that it helps set the proper context for improvements which positively impact maturity levels on the I-O side.

I think that the only reason why it's such a "radical thought" to work on forwarding both of these at the expense of neither is that it's not being evaluated for what its perceived to be, rather than its merits.

kengon
www.kennethgonzalez.com

Why do we define availability?

Ken,

Thank you for this response. I think there are two distinct reasons for defining availability targets.

Firstly it gives the people designing the service the information they need to get the design right. If a two minute outage is not acceptable then they can't use clustering or RAID for example. If you must recover in 2 hours except for one event a year then you can't rely on restoring backups, etc.

Secondly it helps to set expectations, but this doesn't mean that you can deliver a report and say "There you are, customer, now you are satisfied". You can however say "We have met all the agreed targets, did the service meet your needs?" and if the customer says "no" they can clarify their concerns to help you deliver what they really needed next time.

Targets Are Smart!

Stuart,

I agree with you on both points... and I am not shocked about that at all! ;-) That said, I'll offer some additional comments to keep the discussion going.

Point 1:
I think that defining and using availability targets is a very *smart and useful* thing to do. In the USMBOK, we describe this relationship using two terms:
* Service Level Objectives (SLO)
* Service Level Indicators (SLI)

Their use can include, but is not limited to, availability metrics. Typically, these will roll up into metrics that will be part of what is written into a formal (or even informal) agreement. At a minimum, they can be used as a design aid and an early warning system, as well as much more!

Point 2:
Anyone that would use service metrics and reporting in this way deserves the slap in the face that would surely follow. Still , I see examples where reports are generated (even reviewed) where the customer has little/no interest in what the reports say.

They have what they (being the customer) care about and then there's this report that they must "check the box on." They get it, but it doesn't speak to them. We have a lot of work to do, as an industry, to help people adjust their thinking here.

You're right on the money in suggesting that both service providers and customers would be better served by *engaging* and ascertaining whether or not the customer is getting what they contracted the provider for. I think there's value for both parties, as it can serve as the foundation for future improvements.

kengon
www.kennethgonzalez.com

Slides are available for this presentation

Thank you for publishing this.

The presentation on which this content was based is available on slideshare for anyone who would like to read the content in that format

http://www.slideshare.net/stuartrance7/stuart-rance-defining-availabilit...