Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data. Forrester believes that Hadoop will become must-have infrastructure for large enterprises. If you have lots of data, there is a sweet spot for Hadoop in your organization. Here are five reasons firms should adopt Hadoop today:
Build a data lake with the Hadoop file system (HDFS). Firms leave potentially valuable data on the cutting-room floor. A core component of Hadoop is its distributed file system, which can store huge files and many files to scale linearly across three, 10, or 1,000 commodity nodes. Firms can use Hadoop data lakes to break down data silos across the enterprise and commingle data from CRM, ERP, clickstreams, system logs, mobile GPS, and just about any other structured or unstructured data that might contain previously undiscovered insights. Why limit yourself to wading in multiple kiddie pools when you can dive for treasure chests at the bottom of the data lake?
Enjoy cheap, quick processing with MapReduce. You’ve poured all of your data into the lake — now you have to process it. Hadoop MapReduce is a distributed data processing framework that brings the processing to the data in a highly parallel fashion to process and analyze data. Instead of serially reading data from files, MapReduce pushes the processing out to the individual Hadoop nodes where the data resides. The result: Large amounts of data can be processed in parallel in minutes or hours rather than in days. Now you know why Hadoop’s origins stem from monstrous data processing use cases at Google and Yahoo.
Yesterday Intel had a major press and analyst event in San Francisco to talk about their vision for the future of the data center, anchored on what has become in many eyes the virtuous cycle of future infrastructure demand – mobile devices and “the Internet of things” driving cloud resource consumption, which in turn spews out big data which spawns storage and the requirement for yet more computing to analyze it. As usual with these kinds of events from Intel, it was long on serious vision, and strong on strategic positioning but a bit parsimonious on actual future product information with a couple of interesting exceptions.
Content and Core Topics:
No major surprises on the underlying demand-side drivers. The the proliferation of mobile device, the impending Internet of Things and the mountains of big data that they generate will combine to continue to increase demand for cloud-resident infrastructure, particularly servers and storage, both of which present Intel with an opportunity to sell semiconductors. Needless to say, Intel laced their presentations with frequent reminders about who was the king of semiconductor manufacturingJ
Reflections from the 10th Safer Internet Day Conference in Berlin, February 5th 2013
Earlier this month, I had the pleasure of speaking at the Safer Internet Day Conference in Berlin, organized by the Federal Ministry of Consumer Protection, Food and Agriculture and BITKOM, the German Association for Information Technology, Telecommunication and New Media. The conference title, ‘Big Data – Gold Mine or Dynamite?’ set the scene; after my little introductory speech on what big data really means and why this is a relevant topic for all of us (industry, consumers, and government), the follow-up presentations pretty much focused either on the ‘gold mine’ or the ‘dynamite’ aspect. To come straight to the point: I was very surprised, if not slightly shocked at how deep a gap became visible between the industry on the one side and the government (mainly the data protection authorities) on the other side.
While industry representatives, spearheaded by the BITKOM president Prof. Dieter Kempf and speakers from IBM, IMS Health, SAS, and others, highlighted interesting showcases and future opportunities for big data, Peter Schaar, the Federal Commissioner for Data Protection, seemed to be on a crusade to protect ‘innocent citizens’ from the ‘baddies’ in the industry.
It looks that EMC has finally admitted it needs a better approach for courting developers and is doing something significant to fix this. No longer will key assets like Greenplum, Pivotal, or Spring flounder in a corporate culture dominated by infrastructure thinking and selling.
After months of rumors about a possible spin-out going unaddressed, EMC pulled the trigger today, asking Terry Anderson, its VP of Corporate Communications, to put out an official acknowledgement on one of it its blogs (a stealthy, investor-relations-centric move) of its plans to aggregate its cloud and big data assets and give them concentrated focus. It didn't officially announce a spin out or even the creation of a new division. Nor did it clearly identify the role former VMware CEO Paul Maritz will play in this new gathering. But it did clarify what assets would be pushed into this new group:
There's certainly a lot of hype out there about big data. As I previously wrote, some of it is indeed hype, but there are still many legitimate big data cases - I saw a great example during my last business trip. Hadoop certainly plays a key role in the big data revolution, so all business intelligence (BI) vendors are jumping on the bandwagon and saying that they integrate with Hadoop. But what does that really mean? First of all, Hadoop is not a single entity; it's a conglomeration of multiple projects, each addressing a certain niche within the Hadoop ecosystem, such as data access, data integration, DBMS, system management, reporting, analytics, data exploration, and much much more. To lift the veil of hype, I recommend that you ask your BI vendors the following questions
Which specific Hadoop projects do you integrate with (HDFS, Hive, HBase, Pig, Sqoop, and many others)?
Do you work with the community edition software or with commercial distributions from MapR, EMC/Greenplum, Hortonworks, or Cloudera? Have these vendors certified your Hadoop implementations?
Do you have tools, utilities to help the client data into Hadoop in the first place (see comment from Birst)?
Are you querying Hadoop data directly from your BI tools (reports, dashboards) or are you ingesting Hadoop data into your own DBMS? If the latter:
Are you selecting Hadoop result sets using Hive?
Are you ingesting Hadoop data using Sqoop?
Is your ETL generating and pushing down Map Reduce jobs to Hadoop? Are you generating Pig scripts?
If you think the term "Big Data" is wishy washy waste, then you are not alone. Many struggle to find a definition of Big Data that is anything more than awe-inspiring hugeness. But Big Data is real if you have an actionable definition that you can use to answer the question: "Does my organization have Big Data?" Proposed is a definition that takes into account both the measure of data and the activities performed with the data. Be sure to scroll down to calculate your Big Data Score.
Big Data Can Be Measured
Big Data exhibits extremity across one or many of these three alliterate measures:
Emerging ARM server Calxeda has been hinting for some time that they had a significant partnership announcement in the works, and while we didn’t necessarily not believe them, we hear a lot of claims from startups telling us to “stay tuned” for something big. Sometimes they pan out, sometimes they simply go away. But this morning Calxeda surpassed our expectations by unveiling just one major systems partner – but it just happens to be Hewlett Packard, which dominates the WW market for x86 servers.
At its core (unintended but not bad pun), the HP Hyperscale business unit Project Moonshot and Calxeda’s server technology are about improving the efficiency of web and cloud workloads, and promises improvements in excess of 90% in power efficiency and similar improvements in physical density compared with current x86 solutions. As I noted in my first post on ARM servers and other documents, even if these estimates turn out to be exaggerated, there is still a generous window within which to do much, much, better than current technologies. And workloads (such as memcache, Hadoop, static web servers) will be selected for their fit to this new platform, so the workloads that run on these new platforms will potentially come close to the cases quoted by HP and Calxeda.