Big data and Hadoop (Yellow Elephants) are so synonymous that you can easily overlook the vast landscape of architecture that goes into delivering on big data value. Data scientists (Pink Unicorns) are also raised to god status as the only real role that can harness the power of big data -- making insights obtainable from big data as far away as a manned journey to Mars. However, this week, as I participated at the DGIQ conference in San Diego and colleagues and friends attended the Hadoop Summit in Belgium, it has become apparent that organizations are waking up to the fact that there is more to big data than a "cool" playground for the privileged few.
The perspective that the insight supply chain is the driver and catalyst of actions from big data is starting to take hold. Capital One, for example, illustrated that if insights from analytics and data from Hadoop were going to influence operational decisions and actions, you need the same degree of governance as you established in traditional systems. A conversation with Amit Satoor of SAP Global Marketing talked about a performance apparel company linking big data to operational and transactional systems at the edge of customer engagement and that it had to be easy for application developers to implement.
Hadoop distribution, NoSQL, and analytic vendors need to step up the value proposition to be more than where the data sits and how sophisticated you can get with the analytics. In the end, if you can't govern quality, security, and privacy for the scale of edge end user and customer engagement scenarios, those efforts to migrate data to Hadoop and the investment in analytic tools cost more than dollars; they cost you your business.
The business has an insatiable appetite for data and insights. Even in the age of big data, the number one issue of business stakeholders and analysts is getting access to the data. If access is achieved, the next step is "wrangling" the data into a usable data set for analysis. The term "wrangling" itself creates a nervous twitch, unless you enjoy the rodeo. But, the goal of the business isn't to be an adrenalin junky. The goal is to get insight that helps them smartly navigate through increasingly complex business landscapes and customer interactions. Those that get this have introduced a softer term, "blending." Another term dreamed up by data vendor marketers to avoid the dreaded conversation of data integration and data governance.
The reality is that you can't market message your way out of the fundamental problem that big data is creating data swamps even in the best intentioned efforts. (This is the reality of big data's first principle of a schema-less data.) Data governance for big data is primarily relegated to cataloging data and its lineage which serve the data management team but creates a new kind of nightmare for analysts and data scientist - working with a card catalog that will rival the Library of Congress. Dropping a self-service business intelligence tool or advanced analytic solution doesn't solve the problem of familiarizing the analyst with the data. Analysts will still spend up to 80% of their time just trying to create the data set to draw insights.
Early this year a host of inquires were coming in about data quality challenges in CRM systems. This led to a number of joint inquires between myself and CRM expert Kate Legget, VP and Principal Analyst in our application development and delivery team. Seems that the expectations that CRM systems could provide a single trusted view of the customer was starting to hit a reality check. There is more to collecting customer data and activities, you need validation, cleansing, standardization, consolidation, enrichment and hierarchies. CRM applications only get you so far, even with more and more functionality being added to reduce duplicate records and enforce classifications and groups. So, what should companies do?
One of the biggest stumbling blocks is getting business resources to govern data. We've all heard it:
"I don't have time for this."
"Do you really need a full time person?"
"That really isn't my job."
"Isn't that an IT thing?"
"Can we just get a tool or hire a service company to fix the data?"
Let's face it, resources are the data governance killer even in the face of organizations trying to take on enterprise lead data governance efforts.
What we need to do is rethink the data governance bottlenecks and start with the guiding principle that data can only be governed when you have the right culture throughout the organization. The point being, you need accountability with those that actually know something about the data, how it is used, and who feels the most pain. That's not IT, that's not the data steward. It's the customer care representative, the sales executive, the claims processor, the assessor, the CFO, and we can go on. Not really the people you would normally include regularly in your data governance program. Heck, they are busy!
But, the path to sustainable effective data governance is data citizenship - where everyone is a data steward. So, we have to strike the right balance between automation, manual governance, and scale. This is even more important as out data and system ecosystems are exploding in size, sophistication, and speed. In the world of MDM and data quality vendors are looking specifically at how to get around these challenges. There are five (5) areas of innovation:
Spending time at the MDM/DG Summit in NYC this week demonstrated the wide spectrum of MDM implementations and stories out in the market. It certainly coincides with our upcoming MDM inquriry analysis where:
An IT mindset has dominated the way organizations view and manage their data. Even as issues of quality and consistency raise their ugly head, the solution has often been to turn to the tool and approach data governance in a project oriented manner. Sustainability has been a challenge, relegated often to IT managing and updating data management tools (MDM, data quality, metadata management, information lifecycle management, and security). Forrester research has shown that less than 15% of organizations have business lead data governance that is linked to business initiatives, objectives and outcomes. But, this is changing. More and more organizations are looking toward data governance as a strategic enterprise competence as they adopt a data driven culture.
This shift from project to strategic program requires more than basic workflow, collaboration, and data profiling capabilities to institutionalize data governance policies and rules. The conversation can't start with data management technology (MDM, data quality, information lifecycle management, security, and metadata management) that will apply the policies and rules. It has to begin with what is the organization trying to achieve with their data; this is a strategy discussion and process. The implication - governing data requires a rethink of your operating model. New roles, responsibilities, and processes emerge.
The last Forrester Wave for MDM was released in 2008 and focused on the Customer Hub. Well, things have certainly changed since then. Organizations need enterprise scale to break down data silos. Data Governance is quickly becoming part of an organization's operating model. And, don't forget, the big elephant in the room, Big Data.
From 2008 to now there have been multiple analyst firm evaluations of MDM vendors. Vendors come, go or are acquired. But, the leaders are almost always the same. We also see inquiries and implementations tracking to the leaders. Our market overview report helped to identify the distinct segments of MDM vendors and found that MDM leaders were going big, leveraging a strategic perspective of data management, a suite of products, and pushing to support and create modern data management environments. What needed to be addressed, how do you make a decision between these vendors?
The Forrester Wave for the Multi-Platform MDM market segment gets to the heart of this question by pushing top vendors to differentiate amongst themselves and evaluating them at the highest levels of MDM strategy. There were things we learned that surprised us as well as where the line was drawn between marketing messaging and positioning and real capabilities. This was done by positioning the Wave process the way our clients would evaluate vendors, rigorously questioning and fact checking responses and demos.
For decades, firms have deployed applications and BI on independent databases and warehouses, supporting custom data models, scalability, and performance while speeding delivery. It’s become a nightmare to try to integrate the proliferation of data across these sources in order to deliver the unified view of business data required to support new business applications, analytics, and real-time insights. The explosion of new sources, driven by the triple-threat trends of mobile, social, and the cloud, amplified by partner data, market feeds, and machine-generated data, further aggravates the problem. Poorly integrated business data often leads to poor business decisions, reduces customer satisfaction and competitive advantage, and slows product innovation — ultimately limiting revenue.
Forrester’s latest research reveals how leading firms are coping with this explosion using data virtualization, leading us to release a major new version of our reference architecture, Information Fabric 3.0. Since Forrester invented the category of data virtualization eight years ago with the first version of information fabric, these solutions have continued to evolve. In this update, we reflect new business requirements and new technology options including big data, cloud, mobile, distributed in-memory caching, and dynamic services. Use information fabric 3.0 to inform and guide your data virtualization and integration strategy, especially where you require real-time data sharing, complex business transactions, more self-service access to data, integration of all types of data, and increased support for analytics and predictive analytics.
Information fabric 3.0 reflects significant innovation in data virtualization solutions, including:
Big data gurus have said that data quality isn’t important for big data. Good enough is good enough. However, business stakeholders still complain about poor data quality. In fact, when Forrester surveyed customer intelligence professionals, the ability to integrate data and manage data quality are the top two factors holding customer intelligence back.
So, do big data gurus have it wrong? Sort of . . .
I had the chance to attend and present at a marketing event put on by MITX last week in Boston that focused on data science for marketing and customer experience. I recommend all data and big data professionals do this. Here is why. How marketers and agencies talk about big data and data science is different than how IT talks about it. This isn’t just a language barrier, it’s a philosophy barrier. Let’s look at this closer:
Data is totals. When IT talks about data, it’s talking of the physical elements stored in systems. When marketing talks about data, it’s referring to the totals and calculation outputs from analysis.
Quality is completeness. At the MITX event, Panera Bread was asked, how do they understand customers that pay cash? This lack of data didn’t hinder analysis. Panera looked at customers in their loyalty program and promotions that paid cash to make assumptions about this segment and their behavior. Analytics was the data quality tool that completed the customer picture.
Data rules are algorithms. When rules are applied to data, these are more aligned to segmentation and status that would be input into personalized customer interaction. Data rules are not about transformation to marketers.
Sometimes getting the data quality right is just hard, if not impossible. Even after implementing data quality tools, acquiring third-party data feeds, and implementing data steward remediation processes, often the business is still not satisfied with the quality of the data. Data is still missing and considered old or irrelevant. For example: Insurance companies want access to construction data to improve catastrophe modeling. Food chains need to incorporate drop-off bays and instructions for outlets in shopping malls and plazas to get food supplies to the prep tables. Global companies need to validate address information in developing countries that have incomplete or fast-changing postal directories for logistics. What it takes to complete the data and improve it has now entered the realm of hands-on processes.
Crowdflower says they have the answer to the data challenges listed above. It has a model of combining a crowdsourcing model and data stewardship platform to manage the last mile in data quality. The crowd is a vast network of people around the globe that are notified of data quality tasks through a data stewardship platform. If they can help with the data quality need within the time period requester, the contributor accepts the task and get to work. The crowd can use all resources and channels available to them to complete tasks such as web searches, visits, and phone inquiries. Quality control is performed to validate crowdsourced data and improvements. If an organization has more data quality tasks, machine learning is applied to analyze and optimize crowd sourcing based on the scores and results of contributors.