Why It's Important To Start Every Big Data Project With A Question

During a recent webinar on big data, several listeners wanted to know what the biggest stumbling blocks and reasons for failure were when it comes to big data projects, and what they could do to avoid them. Given the amount of resonance, in particular the top issue I cited, I thought I’d share it in this blog post. Please let me have your views and comments.

There are clearly many reasons why projects struggle or fail, and big data projects are no exception. What can put big data initiatives in a league of their own, though, is the level of (typically unrealistic) expectations often associated with “big data” technologies. Based on many conversations with clients, consultants, and conference delegates over the past couple of years, I find three key issues are being mentioned time and again. These are:

  • Not starting the project with a question
  • Underestimating the technical skills and expertise required
  • Creating another data silo

Most people can readily relate to the second and third items, and I won’t go any further into these in this blog post. The first — not starting the project with a question — at first glance seems to go against the grain of what “big data” is meant to be about, in particular when seen in the context of the promises made by technology vendors and in media reports about amazing project successes: “finding the needle in the haystack”; “gaining insights you’ve never had before”; “discovering the unexpected”; “seeing new patterns”; and so on.

These are not unrealistic expectations; you can find the needle in the haystack, gain new insights, etc. However, such successes don’t start with a data scientist or other expert simply “exploring” the data or running random algorithms. Sure, such techniques will yield results; but without context, there’s no way of telling whether these are noise or signal. Consequently, there’s at best no business benefit; at worst, decisions may be taken that are actually detrimental to the business. In addition, a lot of time typically gets wasted on aimlessly “wandering about” amid the data, trying to see what there is to be seen. In either case, projects are either quietly abandoned, or funding doesn’t get renewed.

But doesn’t “starting with a question” raise the risk of missing potentially important insights? The answer is: not really, if approached in the right way. Starting with a question doesn’t mean that the question remains unchanged. Think about it like an Internet search: you start with a question, and get a bunch of results; upon seeing the results, you realize immediately whether your search terms were optimal or not; you change your search terms, you get different results, and repeat as needed; but you’ve always got your original goal in mind. And you’ll also know when to give up. While this analogy somewhat simplifies the issue, it does contain the essence of what’s important when embarking upon a big data project: a goal or a hypothesis. Otherwise, why are you asking the question in the first place?

A goal may be expressed in terms of a key business objective: better customer service, improved targeting of marketing materials, streamlining of processes, or reducing cost, to mention just a few. The catalyst for a big data project may also be a business pain point or problem: customer churn, risk of losing a contract because of service issues, excessive returns, frequent network outages, and so on. In all cases, understanding the goal yields a question, or set of questions. Just like with an Internet search, these may be modified once the first results are in. But the focus remains, and hence the likelihood of success. And because there’s a clear link to what’s important to the business, it’s also less likely that funding will get turned off.

What if you just want to get a better understanding of big data technologies without involving the business, and potentially running the risk of project failure due to skills gaps? There’s nothing wrong with cutting your teeth on big data technologies by applying them with the IT organization. Log file analysis, for example, can be a way of gaining familiarity with new tools and techniques. But again, it works better if there is a clear objective expressed in terms of a question, or questions: What can we do to improve network performance, for example, or reduce server downtime.

Clearly, other steps will need to be taken to make constructive use of new insights, whether it’s purely within the IT department, or involving the business. But when it comes to business-facing projects, it is essential that the starting point is a question, or set of questions, clearly linked to a business goal. Otherwise, the risk is great that “IT will get another black eye,” as my colleague Brian Hopkins put it in his excellent blog post Don’t Have A Big Data Strategy Yet? Good.

Comments

Martha, good article. With

Martha, good article. With the explosion of big data, companies are faced with data challenges in three different areas. First, you know the type of results you want from your data but it’s computationally difficult to obtain. Second, you know the questions to ask but struggle with the answers and need to do data mining to help find those answers. And third is in the area of data exploration where you need to reveal the unknowns and look through the data for patterns and hidden relationships. The open source HPCC Systems big data processing platform can help companies with these challenges by deriving insights from massive data sets quick and simple. Designed by data scientists, it is a complete integrated solution from data ingestion and data processing to data delivery. Their built-in Machine Learning Library and Matrix processing algorithms can assist with business intelligence and predictive analytics. More athttp://hpccsystems.com