If you think you can do big data in-house, get ready for a lot of disappointment. If the data you want to analyze is in the terabytes in size, comes from multiple sources -- streams in from customers, devices or sensors -- and the insights you need are more complex than basic trending, you are probably looking for a data scientist or two. You probably have an open job requisition for an Hadoop expert as well and have hit the limit on what your capital budget will let you buy to house all this data and insights. Thus you are likely taking a hard look at some cloud-based options to fill your short term needs.
By now you have at least seen the cute little elephant logo or you may have spent serious time with the basic components of Hadoop like HDFS, MapReduce, Hive, Pig and most recently YARN. But do you have a handle on Kafka, Rhino, Sentry, Impala, Oozie, Spark, Storm, Tez… Giraph? Do you need a Zookeeper? Apache has one of those too! For example, the latest version of Hortonworks Data Platform has over 20 Apache packages and reflects the chaos of the open source ecosystem. Cloudera, MapR, Pivotal, Microsoft and IBM all have their own products and open source additions while supporting various combinations of the Apache projects.
After hearing the confusion between Spark and Hadoop one too many times, I was inspired to write a report, The Hadoop Ecosystem Overview, Q4 2104. For those that have day jobs that don’t include constantly tracking Hadoop evolution, I dove in and worked with Hadoop vendors and trusted consultants to create a framework. We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require it.
In the past, enterprise architects could afford to think big picture and that meant treating Hadoop as a single package of tools. Not any more – you need to understand the details to keep up in the age of the customer. Use our framework to help, but please read the report if you can as I include a lot more there.
At its Paris summit, the OpenStack community celebrated the 10th release of what has become the leading open source Infrastructure as a Service cloud platform software. What stood out about this latest iteration and the progress of its ever-growing ecosystem of vendors, users and service providers was the lack of excitement that comes with maturity. The Juno release addressed many challenges holding back enterprise adoption to this point and showed signs that 2015 may prove to be the year its use shifts over from mostly test & dev, to mostly production. Forrester clients will find a new Quick Take on OpenStack that analyzes the state of this platform and recommended actions here. In this blog post we look at looming questions facing the OpenStack community that could affect the pace and direction of its innovation.
Hadoop adoption and innovation is moving forward at a fast pace, playing a critical role in today's data economy. But, how fast and far will Hadoop go heading into 2015?
Prediction 1: Hadooponomics makes enterprise adoption mandatory. The jury is in. Hadoop has been found not guilty of being an over-hyped open source platform. Hadoop has proven real enterprise value in any number of use cases including data lakes, traditional and advanced analytics, ETL-less ETL, active-archive, and even some transactional applications. All these use cases are powered by what Forrester calls “Hadooponomics” — its ability to linearly scale both data storage and data processing.
What it means: The remaining minority of dazed and confused CIOs will make Hadoop a priority for 2015.
The data economy — or the system that provides for the exchange of digitized information for the purpose of creating insights and value — grew in 2014, but in 2015 we’ll see it leap forward significantly. It will grow from a phenomenon that mainstream enterprises view at arm’s length as interesting to one that they embrace as a part of business as usual. The number of business and technology leaders telling us that external data is important to their business strategy has been growing rapidly -- from one-third in 2012 to almost half in 2014.
Why? It’s a supply-driven phenomenon made possible by widespread digitization, mobile technology, the Internet of Things (IoT), and Hadooponomics. With countless new data sources and powerful new tools to wrest insights from their depths, organizations will scramble to use them to know their customers better and to optimize their operations beyond anything they could have done before. And while the exploding data supply will spur demand, it will also spur additional supply. Firms will be taking a hard look at their “data exhaust” and wondering if there is a market for new products and services based on their unique set of data. But in many cases, the value in the data is not that people will be willing to pay money for bulk downloads or access to raw data, but in data products that complement a firm’s existing offerings.
But Avoid Ending Up With A Zoo Of Individual Big Data Solutions
We are beyond the point of struggling over the definition of big data. That doesn’t mean that we've resolved all of the confusion that surrounds the term, but companies today are instead struggling with the question of how to actually get started with big data.
28% of all companies are planning a big data project in 2014.
According to Forrester's Business Technographics™ Global Data And Analytics Survey, 2014, 28% of the more than 1600 responding companies globally are planning a Big Data project this year. More details and how this splits between IT and Business driven projects can be found in our new Forrester Report ‘Reset On Big Data’.
Or join our Forrester Forum For Technology Leaders in London, June 12&13, 2014 to hear and discuss with us directly what Big Data projects your peers are planning, what challenges they are facing and what goals they target to achieve.
This week, IBM announced its new line of x86 servers, and included among the usual incremental product improvements is a performance game-changer called eXFlash. eXFlash is the first commercially available implantation of the MCS architecture announced last year by Diablo Technologies. The MCS architecture, and IBM’s eXFlash offering in particular, allows flash memory to be embedded on the system as close to the CPU as main memory, with latencies substantially lower than any other available flash options, offering better performance at a lower solution cost than other embedded flash solutions. Key aspects of the announcement include:
■ Flash DIMMs offer scalable high performance. Write latency (a critical metric) for IBM eXFlash will be in the 5 to 10 microsecond range, whereas best-of-breed competing mezzanine card and PCIe flash can only offer 15 to 20 microseconds (and external flash storage is slower still). Additionally, since the DIMMs are directly attached to the memory controller, flash I/O does not compete with other I/O on the system I/O hub and PCIe subsystem, improving overall system performance for heavily-loaded systems. Additional benefits include linear performance scalability as the number of DIMMs increase and optional built-in hardware mirroring of DIMM pairs.
■ eXFlash DIMMs are compatible with current software. Part of the magic of MCS flash is that it appears to the OS as a standard block-mode device, so all existing block-mode software will work, including applications, caching and tiering or general storage management software. For IBM users, compatibility with IBM’s storage management and FlashCache Storage Accelerator solutions is guaranteed. Other vendors will face zero to low effort in qualifying their solutions.
Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data. Forrester believes that Hadoop will become must-have infrastructure for large enterprises. If you have lots of data, there is a sweet spot for Hadoop in your organization. Here are five reasons firms should adopt Hadoop today:
Build a data lake with the Hadoop file system (HDFS). Firms leave potentially valuable data on the cutting-room floor. A core component of Hadoop is its distributed file system, which can store huge files and many files to scale linearly across three, 10, or 1,000 commodity nodes. Firms can use Hadoop data lakes to break down data silos across the enterprise and commingle data from CRM, ERP, clickstreams, system logs, mobile GPS, and just about any other structured or unstructured data that might contain previously undiscovered insights. Why limit yourself to wading in multiple kiddie pools when you can dive for treasure chests at the bottom of the data lake?
Enjoy cheap, quick processing with MapReduce. You’ve poured all of your data into the lake — now you have to process it. Hadoop MapReduce is a distributed data processing framework that brings the processing to the data in a highly parallel fashion to process and analyze data. Instead of serially reading data from files, MapReduce pushes the processing out to the individual Hadoop nodes where the data resides. The result: Large amounts of data can be processed in parallel in minutes or hours rather than in days. Now you know why Hadoop’s origins stem from monstrous data processing use cases at Google and Yahoo.
Yesterday Intel had a major press and analyst event in San Francisco to talk about their vision for the future of the data center, anchored on what has become in many eyes the virtuous cycle of future infrastructure demand – mobile devices and “the Internet of things” driving cloud resource consumption, which in turn spews out big data which spawns storage and the requirement for yet more computing to analyze it. As usual with these kinds of events from Intel, it was long on serious vision, and strong on strategic positioning but a bit parsimonious on actual future product information with a couple of interesting exceptions.
Content and Core Topics:
No major surprises on the underlying demand-side drivers. The the proliferation of mobile device, the impending Internet of Things and the mountains of big data that they generate will combine to continue to increase demand for cloud-resident infrastructure, particularly servers and storage, both of which present Intel with an opportunity to sell semiconductors. Needless to say, Intel laced their presentations with frequent reminders about who was the king of semiconductor manufacturingJ