Open source big data technologies like Hadoop have done much to begin the transformation of analytics. We're moving from expensive and specialist analytics teams towards an environment in which processes, workflows, and decision-making throughout an organisation can - in theory at least - become usefully data-driven. Established providers of analytics, BI and data warehouse technologies liberally sprinkle Hadoop, Spark and other cool project names throughout their products, delivering real advantages and real cost-savings, as well as grabbing some of the Hadoop glow for themselves. Startups, often closely associated with shepherding one of the newer open source projects, also compete for mindshare and custom.
And the opportunity is big. Hortonworks, for example, has described the global big data market as a $50 billion opportunity. But that pales into insignificance next to what Hortonworks (again) describes as a $1.7 trillion opportunity. Other companies and analysts have their own numbers, which do differ, but the step-change is clear and significant. Hadoop, and the vendors gravitating to that community, mostly address 'data at rest'; data that has already been collected from some process or interaction or query. The bigger opportunity relates to 'data in motion,' and to the internet of things that will be responsible for generating so much of this.
One of the reasons for only a portion of enterprise and external (about a third of structured and a quarter of unstructured -) data being available for insights is a restrictive architecture of SQL databases. In SQL databases data and metadata (data models, aka schemas) are tightly bound and inseparable (aka early binding, schema on write). Changing the model often requires at best just rebuilding an index or an aggregate, at worst - reloading entire columns and tables. Therefore many analysts start their work from data sets based on these tightly bound models, where DBAs and data architects have already built business requirements (that may be outdated or incomplete) into the models. Thus the data delivered to the end-users already contains inherent biases, which are opaque to the user and can strongly influence their analysis. As part of the natural evolution of Business Intelligence (BI) platforms data exploration now addresses this challenge. How? BI pros can now take advantage of ALL raw data available in their enterprises by:
With the incredible popularity of big data and Hadoop every Business Intelligence (BI) vendor wants to also be known as a "BI on Hadoop" vendor. But what they really can do is limited to a) querying HDFS data organized in HIVE tables using HiveQL or b) ingest any flat file into memory and analyze the data there. Basically, to most of the BI vendors Hadoop is just another data source. Let's now see what qualifies a BI vendor as a "Native Hadoop BI Platform". If we assume that all BI platforms have to have data extraction/integration, persistence, analytics and visualization layers, then "Native Hadoop/Spark BI Platforms" should be able to (ok, yes, I just had to add Spark)
Use Hadoop/Spark as the primary processing platform for MOST of the aforementioned functionality. The only exception is visualization layer which is not what Hadoop/Spark do.
Use distributed processing frameworks natively, such as
Generation of MapReduce and/or Spark jobs
Management of distributed processing framework jobs by YARN, etc
Note, generating Hive or SparkSQL queries does not qualify
Do declarative work in the product’s main user interface interpreted and executed on Hadoop/Spark directly. Not via a "pass through" mode.
Natively support Apache Sentry and Apache Ranger security
Not very long ago, it would have been almost inconceivable to consider a new large-scale data analysis project in which the open source Apache Hadoop did not play a pivotal role.
Every Hadoop blog post needs a picture of an elephant. (Source: Paul Miller)
Then, as so often happens, the gushing enthusiasm became more nuanced. Hadoop, some began (wrongly) to mutter, was "just about MapReduce." Hadoop, others (not always correctly) suggested, was "slow."
Then newer tools came along. Hadoop, a growing cacophony (innacurately) trumpeted, was "not as good as Spark."
But, in the real world, Hadoop continues to be great at what it's good at. It's just not good at everything people tried throwing in its direction. We really shouldn't be surprised by this. And yet, it seems, so many of us are.
For CIOs asked to drive new programmes of work in which big data plays a part (and few are not), the competing claims in this space are both unhelpful and confusing. Hadoop and Spark are not, despite some suggestions, directly equivalent. In many cases, asking "Hadoop or Spark" is simply the wrong question.
In the past three decades, management information systems, data integration, data warehouses (DWs), BI, and other relevant technologies and processes only scratched the surface of turning data into useful information and actionable insights:
Organizations leverage less than half of their structured data for insights. The latest Forrester data and analytics survey finds that organizations use on average only 40% of their structured data for strategic decision-making.
Unstructured data remains largely untapped. Organizations are even less mature in their use of unstructured data. They tap only about a third of their unstructured data sources (28% of semistructured and 31% of unstructured) for strategic decision-making. And these percentages don’t include more recent components of a 360-degree view of the customer, such as voice of the customer (VoC), social media, and the Internet of Things.
BI architectures continue to become more complex. The intricacies of earlier-generation and many current business intelligence (BI) architectural stacks, which usually require the integration of dozens of components from different vendors, are just one reason it takes so long and costs so much to deliver a single version of the truth with a seamlessly integrated, centralized enterprise BI environment.
Existing BI architectures are not flexible enough. Most organizations take too long to get to the ultimate goal of a centralized BI environment, and by the time they think they are done, there are new data sources, new regulations, and new customer needs, which all require more changes to the BI environment.
The explosion of data and fast-changing customer needs have led many companies to a realization: They must constantly improve their capabilities, competencies, and culture in order to turn data into business value. But how do Business Intelligence (BI) professionals know whether they must modernize their platforms or whether their main challenges are mostly about culture, people, and processes?
"Our BI environment is only used for reporting — we need big data for analytics."
"Our data warehouse takes very long to build and update — we were told we can replace it with Hadoop."
These are just some of the conversations that Forrester clients initiate, believing they require a big data solution. But after a few probing questions, companies realize that they may need to upgrade their outdated BI platform, switch to a different database architecture, add extra nodes to their data warehouse (DW) servers, improve their data quality and data governance processes, or other commonsense solutions to their challenges, where new big data technologies may be one of the options, but not the only one, and sometimes not the best. Rather than incorrectly assuming that big data is the panacea for all issues associated with poorly architected and deployed BI environments, BI pros should follow the guidelines in the Forrester recent report to decide whether their BI environment needs a healthy dose of upgrades and process improvements or whether it requires different big data technologies. Here are some of the findings and recommendations from the full research report:
Last year I published a reasonably well-received research document on Hadoop infrastructure, “Building the Foundations for Customer Insight: Hadoop Infrastructure Architecture”. Now, less than a year later it’s looking obsolete, not so much because it was wrong for traditional (and yes, it does seem funny to use a word like “traditional” to describe a technology that itself is still rapidly evolving and only in mainstream use for a handful of years) Hadoop, but because the universe of analytics technology and tools has been evolving at light-speed.
If your analytics are anchored by Hadoop and its underlying map reduce processing, then the mainstream architecture described in the document, that of clusters of servers each with their own compute and storage, may still be appropriate. On the other hand, if, like many enterprises, you are adding additional analysis tools such as NoSQL databases, SQL on Hadoop (Impala, Stinger, Vertica) and particularly Spark, an in-memory-based analytics technology that is well suited for real-time and streaming data, it may be necessary to begin reassessing the supporting infrastructure in order to build something that can continue to support Hadoop as well as cater to the differing access patterns of other tools sets. This need to rethink the underlying analytics plumbing was brought home by a recent demonstration by HP of a reference architecture for analytics, publicly referred to as the HP Big Data Reference Architecture.
At the China Hadoop Summit 2015 in Beijing this past weekend, I talked with various big data players, including large consumers of big data China Unicom, Baidu.com, JD.com, and Ctrip.com; Hadoop platform solution providers Hortonworks, RedHadoop, BeagleData, and Transwarp; infrastructure software vendors like Sequotia.com; and Agile BI software vendors like Yonghong Tech.
The summit was well-attended — organizers planned for 1,000 attendees and double that number attended — and from the presentations and conversations it’s clear that big data ecosystems are making substantial progress. Here are some of my key takeaways:
Telcos are focusing on optimizing internal operations with big data.Take China Unicom, one of China’s three major telcos, for example. China Unicom has completed a comprehensive business scenario analysis of related data across each segment of internal business operations, including business and operations support systems, Internet data centers, and networks (fixed, mobile, and broadband). It has built a Hadoop-based big data platform to process trillions of mobile access records every day within the mobile network to provide practical guidelines and progress monitoring on the construction of base stations.
By now you have at least seen the cute little elephant logo or you may have spent serious time with the basic components of Hadoop like HDFS, MapReduce, Hive, Pig and most recently YARN. But do you have a handle on Kafka, Rhino, Sentry, Impala, Oozie, Spark, Storm, Tez… Giraph? Do you need a Zookeeper? Apache has one of those too! For example, the latest version of Hortonworks Data Platform has over 20 Apache packages and reflects the chaos of the open source ecosystem. Cloudera, MapR, Pivotal, Microsoft and IBM all have their own products and open source additions while supporting various combinations of the Apache projects.
After hearing the confusion between Spark and Hadoop one too many times, I was inspired to write a report, The Hadoop Ecosystem Overview, Q4 2104. For those that have day jobs that don’t include constantly tracking Hadoop evolution, I dove in and worked with Hadoop vendors and trusted consultants to create a framework. We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require it.
In the past, enterprise architects could afford to think big picture and that meant treating Hadoop as a single package of tools. Not any more – you need to understand the details to keep up in the age of the customer. Use our framework to help, but please read the report if you can as I include a lot more there.