My colleague and friend Mike Gualtieri wrote a really interesting blog the other day titled "Agile Software Is A Cop-Out; Here's What's Next." While I am not going to discuss the great conclusions and "next practices" of software (SW) development Mike suggests in that blog, I do want to focus on the assumption he makes about using working SW as a measurement of Agile.
I am currently researching that area and investigating how organizations actually measure the value of Agile SW development (business and IT value). And I am finding that, while organizations aim to deliver working SW, they also define value metrics to measure progress and much more:
Cycle time (e.g., from concept to production);
Business value (from number of times a feature is used by clients to impact on sales revenue, etc.);
Productivity metrics (such as burndown velocity, number of features deployed versus estimated); and last but not least
Quality metrics (such as defects per sprint/release, etc.).
We live in a time when customers expect services to be delivered non-stop, without interruption, 24x7x365. Need proof? Just look at the outrage this week stemming from RIM's 3+ day BlackBerry service/outage impairment. Yes, this was an unusually long and widespread disruption, but it seems like every week there is a new example of a service disruption whipping social networks and blogs into a frenzy, whether it's Bank of America, Target, or Amazon. I'm not criticizing those who use social media outlets to voice their dissatisfaction over service levels (I've even taken part in it, complaining on Twitter about Netflix streaming being down on a Friday night when I wanted to stream a movie), but pointing out that now more than ever infrastructure and operations professionals need to rethink how they deliver services to both their internal and external customers.
One thing that I’ve found in common across infrastructure and operations groups of all shapes and sizes is that they are continually searching for the ideal set of key performance indicators. A set of metrics that perfectly measures their infrastructure, demonstrates the excellence of their operations, but are still simple and cheap to collect. At least once a week I speak with a client searching for the holy grail of metrics, hopeful that I hold that coveted knowledge. They’re inevitably disappointed to find out that I don’t know what the best set of metrics is, and that I truly think that it doesn’t exist! Sorry if I’m bursting your bubble, but there is no essential set of metrics for all infrastructure and operations organizations. What makes sense for one organization to measure may be completely useless for another organization. What may be very simple to collect at one company is nearly impossible at another.
While I don’t believe in the myth of a single set of perfect metrics for all organizations, I do think it is valuable to learn from other organizations what they are measuring in order to compare them to your own metrics (and maybe steal some of theirs), which is why I am gathering a list of metrics from infrastructure and operations groups globally in order to form a database of metrics. Once we have a good number of metrics on this list, I will work to consolidate them down to the most commonly cited metrics and collect a benchmark on them. We’re calling this project “Forrester's Consensus Metrics For Infrastructure & Operations” and I really hope you’ll consider contributing to it because we can’t do this without your input.
“Are you on the business side or the IT side?” was a question I received maybe a half dozen times last month while I was attending the Disaster Recovery Journal Fall World in San Diego. This question really got me thinking—everyone at the conference worked in business continuity (BC) and/or disaster recovery (DR), but there was a definite divide between those who reported into IT departments and those who reported into the business. For the most part, these divisions fell along the lines of those who reported into IT had a DR focus and those who reported into the business (or perhaps into security and risk) had a BC focus. Attending the different breakout sessions across both domains I noted the good news: both groups speak the same language: RTO, RPO, availability, downtime, resilience, etc. The bad news is that I’m not sure we’re all using the same dictionary.
Two of the business-focused sessions I attended pointed out a troubling difference in the way IT and the business interpret one of the simplest of BC/DR terms: RTO. What is RTO? Simply put, it is the time to recover a service after an outage. This seems straightforward enough, but let’s breaks out how a business and an IT professional might understand RTO:
Business: The maximum amount of time that my service can be unavailable.
IT: The amount of time it takes to recover that service.