There's certainly a lot of hype out there about big data. As I previously wrote, some of it is indeed hype, but there are still many legitimate big data cases - I saw a great example during my last business trip. Hadoop certainly plays a key role in the big data revolution, so all business intelligence (BI) vendors are jumping on the bandwagon and saying that they integrate with Hadoop. But what does that really mean? First of all, Hadoop is not a single entity; it's a conglomeration of multiple projects, each addressing a certain niche within the Hadoop ecosystem, such as data access, data integration, DBMS, system management, reporting, analytics, data exploration, and much much more. To lift the veil of hype, I recommend that you ask your BI vendors the following questions
Which specific Hadoop projects do you integrate with (HDFS, Hive, HBase, Pig, Sqoop, and many others)?
Do you work with the community edition software or with commercial distributions from MapR, EMC/Greenplum, Hortonworks, or Cloudera? Have these vendors certified your Hadoop implementations?
Do you have tools, utilities to help the client data into Hadoop in the first place (see comment from Birst)?
Are you querying Hadoop data directly from your BI tools (reports, dashboards) or are you ingesting Hadoop data into your own DBMS? If the latter:
Are you selecting Hadoop result sets using Hive?
Are you ingesting Hadoop data using Sqoop?
Is your ETL generating and pushing down Map Reduce jobs to Hadoop? Are you generating Pig scripts?
So, this blog is dedicated to stepping outside the comfort zone once again and into the world of chaos. Not only may you not want to persist in your data quality transformations, but you may not want to cleanse the data.
Current thinking: Purge poor data from your environment. Put the word “risk” in the same sentence as data quality and watch the hackles go up on data quality professionals. It is like using salt in your coffee instead of sugar. However, the biggest challenge I see many data quality professionals face is getting lost in all the data due to the fact that they need to remove risk to the business caused by bad data. In the world of big data, clearly you are not going to be able to cleanse all that data. A best practice is to identify critical data elements that have the most impact on the business and focus efforts there. Problem solved.
Not so fast. Even scoping the data quality effort may not be the right way to go. The time and effort it takes as well as the accessibility of the data may not meet business needs to get information quickly. The business has decided to take the risk, focusing on direction rather than precision.
I recently had both the privilege and pleasure to do a deep dive into the cold and warm BI waters in Russia and Israel. Cold - because some of my experiences were sobering. Warm - because the reception could not have been more pleasant. My presentations were well attended (sponsored by www.in4media.ru in Russia and www.matrix.co.il in Israel), showing high levels of BI interest, adoption, experience, and expertise. Challenges remain the same, as Russian and Israeli businesses struggle with BI governance, ownership, SDLC and PMO methodologies, data, and app integration just like the rest of the world. I spent long evening hours with a large global company in Israel that grew rapidly by M&A and is struggling with multiple strategic challenges: centralize or localize BI, vendor selection, end user empowerment, etc. Sound familiar?
But it was not all business as usual. A few interesting regional peculiarities did come out. For example, the "BI as a key competitive differentiator" message fell on mostly deaf ears in Russia, as Russian companies don't really compete against each other. Territories, brands, markets, and spheres of influence are handed top down from the government or negotiated in high-level deals behind closed doors. That is not to say, however, that BI in Russia is only used for reporting - multiple businesses are pushing BI to the limits such as advanced customer segmentation for better upsell/cross-sell rates.
I was also pleasantly surprised and impressed a few times (and for those of you who know me well, you know that it's pretty hard to impress the old veteran):
We last spoke about how to reboot our thinking on master data to provide a more flexible and useful structure when working with big data. In the structured data world, having a model to work from provides comfort. However, there is an element of comfort and control that has to be given up with big data, and that is our definition and the underlying premise for data quality.
Current thinking: Persistence of cleansed data.For years data quality efforts have focused on finding and correcting bad data. We used the word “cleansing” to represent the removal of what we didn’t want, exterminating it like it was an infestation of bugs or rats. Knowing what your data is, what it should look like, and how to transform it into submission defined the data quality handbook. Whole practices were stood up to track data quality issues, establish workflows and teams to clean the data, and then reports were produced to show what was done. Accomplishment was the progress and maintenance of the number of duplicates, complete records, last update, conformance to standards, etc. Our reports may also be tied to our personal goals. Now comes big data — how do we cleanse and tame that beast?
Reboot: Disposability of data quality transformation. The answer to the above question is, maybe you don’t. The nature of big data doesn’t allow itself to traditional data quality practices. The volume may be too large for processing. The volatility and velocity of data change too frequently to manage. The variety of data, both in scale and visibility, is ambiguous.
What data do you trust? Increasingly, business stakeholders and data scientists trust the information hidden in the bowels of big data. Yet, how data is mined mostly circumvents existing data governance and data architecture due to speed of insight required and support data discovery over repeatable reporting.
The key to this challenge is a data quality reboot: rethink what matters, and rethink data governance.
Part 1 of our Data Quality Reboot Series is to rethink master data management (MDM) in a big data world.
Current thinking: Master data as a single data entity. A common theme I hear from clients is that master data is about the linked data elements for a single record. No duplication or variation exists to drive consistency and uniqueness. Master data in the current thinking represents a defined, named entity (customer, supplier, product, etc.). This is a very static view of master data and does not account for the various dimensions required for what is important within a particular use case. We typically see this approach tied tightly to an application (customer resource management, enterprise resource management) for a particular business unit (marketing, finance, product management, etc.). It may have been the entry point for MDM initiatives, and it allowed for smaller scope tangible wins. But, it is difficult to expand that master data to other processes, analysis, and distribution points. Master data as a static entity only takes you so far, regardless of whether big data is incorporated into the discussion or not.
Cloud Services Offer New Opportunities For Big Data Solutions
What’s better than writing about one hot topic? Well, writing about two hot topics in one blog post — and here you go:
The State Of BI In The Cloud
Over the past few years, BI business intelligence (BI) was the overlooked stepchild of cloud solutions and market adoption. Sure, some BI software-as-a-service (SaaS) vendors have been pretty successful in this space, but it was success in a niche compared with the four main SaaS applications: customer relationship management (CRM), collaboration, human capital management (HCM), and eProcurement. While those four applications each reached cloud adoption of 25% and more in North America and Western Europe, BI was leading the field of second-tier SaaS solutions used by 17% of all companies in our Forrester Software Survey, Q4 2011. Considering that the main challenges of cloud computing are data security and integration efforts (yes, the story of simply swiping your credit card to get a full operational cloud solution in place is a fairy tale), 17% cloud adoption is actually not bad at all; BI is all about data integration, data analysis, and security. With BI there is of course the flexibility to choose which data a company considers to run in a cloud deployment and what data sources to integrate — a choice that is very limited when implementing, e.g., a CRM or eProcurement cloud solution.
“38% of all companies are planning a BI SaaS project before the end of 2013.”
I’ll be chairing Big Data World Europe on September 19 in London; in advance of that event, here are a few thoughts.
Since late 2011, we’ve seen the big data noise level eclipse cloud and even BYOD, and we are seeing the backlash too (see Death By Big Data, to which I tweeted, “Yes, I suppose, ‘too much of anything is a bad thing’”). The number one thing clients want to know is, “What is my competition doing? Give me examples I can talk to my business about.” These questions reflect a curiosity on the part of IT and a “peeking under the hood to see what’s there” attitude.
My advice is to start the big data journey with your feet on the ground and your head around what it really is. Here are some “rules” I’ve been using with folks I talk to:
● First rule of big data: don’t talk about big data. The old adage holds true here — those that can do big data do it, those that can’t talk <yup, I see the irony :-)>. I was on the phone with a VP of analytics who reflected that her IT people were constantly bringing new technologies to them like a dog with a bone. Her general reaction is, show me the bottom-line value. So what to do? Instead of talking to your business about big data, find ways to solve problems more affordably with data at greater scale. Now that’s “doing big data.”
As the new analyst on the block at Forrester, the first question everyone is asking is, “What research do you have planned?” Just to show that I’m up for the task, rather than keeping it simple with a thoughtful report on data quality best practices or a maturity assessment on data management, I thought I’d go for broke and dive into the master data management (MDM) landscape. Some might call me crazy, but this is more than just the adrenaline rush that comes from doing such a project. In over 20 inquiries with clients in the past month, questions show increased sophistication in how managing master data can strategically contribute to the business.
What do I mean by this?
Number 1: Clients want to know how to bring together transitional data (structured) and content (semi-structured and unstructured) to understand the customer experience, improve customer engagement, and maximize the value of the customer. Understanding customer touch points across social media, e-commerce, customer service, and content consumption provides a single customer view that lets you customize your interactions and be highly relevant to your customer. MDM is at the heart of bringing this view together.
Number 2: Clients have begun to analyze big data within side projects as a way to identify opportunities for the business. This intelligence has reached the point that clients are now exploring how to distribute and operationalize these insights throughout the organization. MDM is the point that will align discoveries within the governance of master data for context and use.
I love predictive analytics. I mean, who wouldn't want to develop an application that could help you make smart business decisions, sell more stuff, make customers happy, and avert disasters. Predictive analytics can do all that, but it is not easy. In fact, it can range from being impossible to hard depending on:
Causative data. The lifeblood of predictive analytics is data. Data can come from internal systems such as customer transactions or manufacturing defect data. It is often appropriate to include data from external sources such as industry market data, social networks, or statistics. Contrary to popular technology beliefs, it does not always need to be big data. It is far more important that the data contain variables that can be used to predict an effect. Having said that, the more data you have, the better chance you have of finding cause and effect. Big data no guarantee of success.
Earlier this week Dell joined arch-competitor HP in endorsing ARM as a potential platform for scale-out workloads by announcing “Copper,” an ARM-based version of its PowerEdge-C dense server product line. Dell’s announcement and positioning, while a little less high-profile than HP’s February announcement, is intended to serve the same purpose — to enable an ARM ecosystem by providing a platform for exploring ARM workloads and to gain a visible presence in the event that it begins to take off.
Dell’s platform is based on a four-core Marvell ARM V7 SOC implementation, which it claims is somewhat higher performance than the Calxeda part, although drawing more power, at 15W per node (including RAM and local disk). The server uses the PowerEdge-C form factor of 12 vertically mounted server modules in a 3U enclosure, each with four server nodes on them for a total of 48 servers/192 cores in a 3U enclosure. In a departure from other PowerEdge-C products, the Copper server has integrated L2 network connectivity spanning all servers, so that the unit will be able to serve as a low-cost test bed for clustered applications without external switches.
Dell is offering this server to selected customers, not as a GA product, along with open source versions of the LAMP stack, Crowbar, and Hadoop. Currently Cannonical is supplying Ubuntu for ARM servers, and Dell is actively working with other partners. Dell expects to see OpenStack available for demos in May, and there is an active Fedora project underway as well.