The business has an insatiable appetite for data and insights. Even in the age of big data, the number one issue of business stakeholders and analysts is getting access to the data. If access is achieved, the next step is "wrangling" the data into a usable data set for analysis. The term "wrangling" itself creates a nervous twitch, unless you enjoy the rodeo. But, the goal of the business isn't to be an adrenalin junky. The goal is to get insight that helps them smartly navigate through increasingly complex business landscapes and customer interactions. Those that get this have introduced a softer term, "blending." Another term dreamed up by data vendor marketers to avoid the dreaded conversation of data integration and data governance.
The reality is that you can't market message your way out of the fundamental problem that big data is creating data swamps even in the best intentioned efforts. (This is the reality of big data's first principle of a schema-less data.) Data governance for big data is primarily relegated to cataloging data and its lineage which serve the data management team but creates a new kind of nightmare for analysts and data scientist - working with a card catalog that will rival the Library of Congress. Dropping a self-service business intelligence tool or advanced analytic solution doesn't solve the problem of familiarizing the analyst with the data. Analysts will still spend up to 80% of their time just trying to create the data set to draw insights.
Last year I published a reasonably well-received research document on Hadoop infrastructure, “Building the Foundations for Customer Insight: Hadoop Infrastructure Architecture”. Now, less than a year later it’s looking obsolete, not so much because it was wrong for traditional (and yes, it does seem funny to use a word like “traditional” to describe a technology that itself is still rapidly evolving and only in mainstream use for a handful of years) Hadoop, but because the universe of analytics technology and tools has been evolving at light-speed.
If your analytics are anchored by Hadoop and its underlying map reduce processing, then the mainstream architecture described in the document, that of clusters of servers each with their own compute and storage, may still be appropriate. On the other hand, if, like many enterprises, you are adding additional analysis tools such as NoSQL databases, SQL on Hadoop (Impala, Stinger, Vertica) and particularly Spark, an in-memory-based analytics technology that is well suited for real-time and streaming data, it may be necessary to begin reassessing the supporting infrastructure in order to build something that can continue to support Hadoop as well as cater to the differing access patterns of other tools sets. This need to rethink the underlying analytics plumbing was brought home by a recent demonstration by HP of a reference architecture for analytics, publicly referred to as the HP Big Data Reference Architecture.
At the China Hadoop Summit 2015 in Beijing this past weekend, I talked with various big data players, including large consumers of big data China Unicom, Baidu.com, JD.com, and Ctrip.com; Hadoop platform solution providers Hortonworks, RedHadoop, BeagleData, and Transwarp; infrastructure software vendors like Sequotia.com; and Agile BI software vendors like Yonghong Tech.
The summit was well-attended — organizers planned for 1,000 attendees and double that number attended — and from the presentations and conversations it’s clear that big data ecosystems are making substantial progress. Here are some of my key takeaways:
Telcos are focusing on optimizing internal operations with big data.Take China Unicom, one of China’s three major telcos, for example. China Unicom has completed a comprehensive business scenario analysis of related data across each segment of internal business operations, including business and operations support systems, Internet data centers, and networks (fixed, mobile, and broadband). It has built a Hadoop-based big data platform to process trillions of mobile access records every day within the mobile network to provide practical guidelines and progress monitoring on the construction of base stations.
If you think you can do big data in-house, get ready for a lot of disappointment. If the data you want to analyze is in the terabytes in size, comes from multiple sources -- streams in from customers, devices or sensors -- and the insights you need are more complex than basic trending, you are probably looking for a data scientist or two. You probably have an open job requisition for an Hadoop expert as well and have hit the limit on what your capital budget will let you buy to house all this data and insights. Thus you are likely taking a hard look at some cloud-based options to fill your short term needs.
By now you have at least seen the cute little elephant logo or you may have spent serious time with the basic components of Hadoop like HDFS, MapReduce, Hive, Pig and most recently YARN. But do you have a handle on Kafka, Rhino, Sentry, Impala, Oozie, Spark, Storm, Tez… Giraph? Do you need a Zookeeper? Apache has one of those too! For example, the latest version of Hortonworks Data Platform has over 20 Apache packages and reflects the chaos of the open source ecosystem. Cloudera, MapR, Pivotal, Microsoft and IBM all have their own products and open source additions while supporting various combinations of the Apache projects.
After hearing the confusion between Spark and Hadoop one too many times, I was inspired to write a report, The Hadoop Ecosystem Overview, Q4 2104. For those that have day jobs that don’t include constantly tracking Hadoop evolution, I dove in and worked with Hadoop vendors and trusted consultants to create a framework. We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require it.
In the past, enterprise architects could afford to think big picture and that meant treating Hadoop as a single package of tools. Not any more – you need to understand the details to keep up in the age of the customer. Use our framework to help, but please read the report if you can as I include a lot more there.
At its Paris summit, the OpenStack community celebrated the 10th release of what has become the leading open source Infrastructure as a Service cloud platform software. What stood out about this latest iteration and the progress of its ever-growing ecosystem of vendors, users and service providers was the lack of excitement that comes with maturity. The Juno release addressed many challenges holding back enterprise adoption to this point and showed signs that 2015 may prove to be the year its use shifts over from mostly test & dev, to mostly production. Forrester clients will find a new Quick Take on OpenStack that analyzes the state of this platform and recommended actions here. In this blog post we look at looming questions facing the OpenStack community that could affect the pace and direction of its innovation.
Hadoop adoption and innovation is moving forward at a fast pace, playing a critical role in today's data economy. But, how fast and far will Hadoop go heading into 2015?
Prediction 1: Hadooponomics makes enterprise adoption mandatory. The jury is in. Hadoop has been found not guilty of being an over-hyped open source platform. Hadoop has proven real enterprise value in any number of use cases including data lakes, traditional and advanced analytics, ETL-less ETL, active-archive, and even some transactional applications. All these use cases are powered by what Forrester calls “Hadooponomics” — its ability to linearly scale both data storage and data processing.
What it means: The remaining minority of dazed and confused CIOs will make Hadoop a priority for 2015.
The data economy — or the system that provides for the exchange of digitized information for the purpose of creating insights and value — grew in 2014, but in 2015 we’ll see it leap forward significantly. It will grow from a phenomenon that mainstream enterprises view at arm’s length as interesting to one that they embrace as a part of business as usual. The number of business and technology leaders telling us that external data is important to their business strategy has been growing rapidly -- from one-third in 2012 to almost half in 2014.
Why? It’s a supply-driven phenomenon made possible by widespread digitization, mobile technology, the Internet of Things (IoT), and Hadooponomics. With countless new data sources and powerful new tools to wrest insights from their depths, organizations will scramble to use them to know their customers better and to optimize their operations beyond anything they could have done before. And while the exploding data supply will spur demand, it will also spur additional supply. Firms will be taking a hard look at their “data exhaust” and wondering if there is a market for new products and services based on their unique set of data. But in many cases, the value in the data is not that people will be willing to pay money for bulk downloads or access to raw data, but in data products that complement a firm’s existing offerings.
But Avoid Ending Up With A Zoo Of Individual Big Data Solutions
We are beyond the point of struggling over the definition of big data. That doesn’t mean that we've resolved all of the confusion that surrounds the term, but companies today are instead struggling with the question of how to actually get started with big data.
28% of all companies are planning a big data project in 2014.
According to Forrester's Business Technographics™ Global Data And Analytics Survey, 2014, 28% of the more than 1600 responding companies globally are planning a Big Data project this year. More details and how this splits between IT and Business driven projects can be found in our new Forrester Report ‘Reset On Big Data’.
Or join our Forrester Forum For Technology Leaders in London, June 12&13, 2014 to hear and discuss with us directly what Big Data projects your peers are planning, what challenges they are facing and what goals they target to achieve.