The Gulf oil spill of April 2010 was an unprecedented disaster. The National Oil Spill Commission’s report summary shows that this could have been prevented with the use of better technology. For example, while the Commission agrees that the monitoring systems used on the platform provided the right data, it points out that the solution used relied on engineers to make sense of that data and correlate the right elements to detect anomalies. “More sophisticated, automated alarms and algorithms” could have been used to create meaningful alerts and maybe prevent the explosion. The Commission’s report shows that the reporting systems used have not kept pace with the increased complexity of drilling platforms. Another conclusion is even more disturbing, as it points out that these deficiencies are not uncommon and that other drilling platforms in the Gulf of Mexico face similar challenges.
If we substitute “drilling platform” with “data center,” this sound awfully familiar. How many IT organizations are relying on relatively simple data collection coming from point monitoring such as network, server, or application while trying to manage the performance and availability of increasingly complex applications? IT operations engineers sift through mountains of data from different sources trying to make sense of what is happening and usually fall short of finding meaningful alerts. The consequences may not be as dire as the Gulf oil spill, but they can still translate into lost productivity and revenue.
The fact that many IT operations have not (yet) faced a meltdown is not a valid counterargument: There is, for example, a good reason to purchase hurricane insurance when one lives in Florida, even though destructive storms are not that common. Like the weather, there are so many variables at play in today’s business services that mere humans can’t be expected to make sense of it.