Sometimes getting the data quality right is just hard, if not impossible. Even after implementing data quality tools, acquiring third-party data feeds, and implementing data steward remediation processes, often the business is still not satisfied with the quality of the data. Data is still missing and considered old or irrelevant. For example: Insurance companies want access to construction data to improve catastrophe modeling. Food chains need to incorporate drop-off bays and instructions for outlets in shopping malls and plazas to get food supplies to the prep tables. Global companies need to validate address information in developing countries that have incomplete or fast-changing postal directories for logistics. What it takes to complete the data and improve it has now entered the realm of hands-on processes.
Crowdflower says they have the answer to the data challenges listed above. It has a model of combining a crowdsourcing model and data stewardship platform to manage the last mile in data quality. The crowd is a vast network of people around the globe that are notified of data quality tasks through a data stewardship platform. If they can help with the data quality need within the time period requester, the contributor accepts the task and get to work. The crowd can use all resources and channels available to them to complete tasks such as web searches, visits, and phone inquiries. Quality control is performed to validate crowdsourced data and improvements. If an organization has more data quality tasks, machine learning is applied to analyze and optimize crowd sourcing based on the scores and results of contributors.
I had a conversation recently with Brian Lent, founder, chairman, and CTO of Medio. If you don’t know Brian, he has worked with companies such as Google and Amazon to build and hone their algorithms and is currently taking predictive analytics to mobile engagement. The perspective he brings as a data scientist not only has ramifications for big data analytics, but drastically shifts the paradigm for how we architect our master data and ensure quality.
We discussed big data analytics in the context of behavior and engagement. Think shopping carts and search. At the core, analytics is about the “closed loop.” It is, as Brian says, a rinse and repeat cycle. You gain insight for relevant engagement with a customer, you engage, then you take the results of that engagement and put them back into the analysis.
Sounds simple, but think about what that means for data management. Brian provided two principles:
I recently had a client ask about MDM measurement for their customer master. In many cases, the discussions I have about measurement is how to show that MDM has "solved world hunger" for the organization. In fact, a lot of the research and content out there focused on just that. Great to create a business case for investment. Not so good in helping with the daily management of master data and data governance. This client question is more practical, touching upon:
what about the data do you measure?
how do you calculate?
how frequently do you report and show trends?
how do you link the calculation to something the business understands?
I just came back from a Product Information Management (PIM) event this week had had a lot of discussions about how to evaluate vendors and their solutions. I also get a lot of inquiries on vendor selection and while a lot of the questions center around the functionality itself, how to evaluate is also a key point of discussion. What peaked my interest on this subject is that IT and the Business have very different objectives in selecting a solution for MDM, PIM, and data quality. In fact, it can often get contentious when IT and the Business don't agree on the best solution.
General steps to purchase a solution seem pretty consistent: create a short list based on the Forrester Wave and research, conduct an RFI, narrow down to 2-3 vendors for an RFP, make a decision. But, the devil seems to be in the details.
Is a proof of concept required?
How do you make a decision when vendors solutions appear the same? Are they really the same?
How do you put pricing into context? Is lowest really better?
What is required to know before engaging with vendors to identify fit and differentiation?
When does meeting business objectives win out over fit in IT skills and platform consistency?
Joining in on the spirit of all the 2013 predictions, it seems that we shouldn't leave data quality out of the mix. Data quality may not be as sexy as big data has been this past year. The technology is mature and reliable. The concept easy to understand. It is also one of the few areas in data management that has a recognized and adopted framework to measure success. (Read Malcolm Chisholm's blog on data quality dimensions) However, maturity shouldn't create complancency. Data quality still matters, a lot.
Yet, judgement day is here and data quality is at a cross roads. It's maturity in both technology and practice is steeped in an old way of thinking about and managing data. Data quality technology is firmly seated in the world of data warehousing and ETL. While still a significant portion of an enterprise data managment landscape, the adoption and use in business critical applications and processes of in-memory, Hadoop, data virtualization, streams, etc means that more and more data is bypassing the traditional platform.
The options to manage data quality are expanding, but not necessarily in a way that ensures that data can be trusted or complies with data policies. Where data quality tools have provided value is in the ability to have a workbench to centrally monitor, create and manage data quality processes and rules. They created sanity where ETL spaghetti created chaos and uncertainty. Today, this value proposition has diminished as data virtualization, Hadoop processes, and data appliances create and persist new data quality silos. To this, these data quality silos often do not have the monitoring and measurement to govern data. In the end, do we have data quality? Or, are we back where we started from?
I often see two ends of the extreme when I talk to clients who are trying to deal with data confidence challenges. One group typically sees it as a problem that IT has to address, while business users continue to use spreadsheets and other home-grown apps for BI. At the other end of the extreme, there's a strong, take-no-prisoners, top-down mandate for using only enterprise BI apps. In this case, a CEO may impose a rule that says that you can't walk into my office, ask me to make a decision, ask for a budget, etc., based on anything other than data coming from an enterprise BI application. This may sound great, but it's not often very practical; the world is not that simple, and there are many shades of grey in between these two extremes. No large, global, heterogeneous, multi-business- and product-line enterprise can ever hope to clean up all of its data - it's always a continuous journey. The key is knowing what data sources feed your BI applications and how confident you are about the accuracy of data coming from each source.
For example, here's one approach that I often see work very well. In this approach, IT assigns a data confidence index (an extra column attached to each transactional record in your data warehouse, data mart, etc.) during ETL processes. It may look something like this:
If data is coming from a system of record, the index = 100%.
If data is coming from nonfinancial systems and it reconciles with your G/L, the index = 100%. If not, it's < 100%.
The number one reason I hear from IT organizations for why they want to embark on MDM is for consolidation or integration of systems. Then, the first question I get, how do they get buy-in from the business to pay for it?
My first reaction is to cringe because the implication is that MDM is a data integration tool and the value is the matching capabilities. While matching is a significant capability, MDM is not about creating a golden record or a single source of truth.
My next reaction is that IT missed the point that the business wants data to support a system of engagement. The value of MDM is to be able to model and render a domain to fit a system of engagement. Until you understand and align to that, your MDM effort will not support the business and you won’t get the funding. If you somehow do get the funding, you won’t be able to appropriately select the MDM tool that is right for the business need, thus wasting time, money, and resources.
Here is why I am not a fan of the “single source of truth” mantra. A person is not one-dimensional; they can be a parent, a friend, or a colleague, and each has different motivations and requirements depending on the environment. A product is as much about the physical aspect as it is about the pricing, message, and sales channel it is sold through. Or, it is also faceted by the fact that it is put together from various products and parts from partners. In no way is a master entity unique or has a consistency depending on what is important about the entity in a given situation. What MDM provides are definitions and instructions on the right data to use in the right engagement. Context is a key value of MDM.
There was lots of feedback on the last blog (“Risk Data, Risky Business?”) that clearly indicates the divide between definitions in trust and quality. It is a great jumping off point for the next hot topic, data governance for big data.
The comment I hear most from clients, particularly when discussing big data, is, “Data governance inhibits agility.” Why be hindered by committees and bureaucracy when you want freedom to experiment and discover?
Current thinking: Data governance is freedom from risk.The stakes are high when it comes to data-intensive projects, and having the right alignment between IT and the business is crucial. Data governance has been the gold standard to establish the right roles, responsibilities, processes, and procedures to deliver trusted secure data. Success has been achieved through legislative means by enacting policies and procedures that reduce risk to the business from bad data and bad data management project implementation. Data governance was meant to keep bad things from happening.
Today’s data governance approach is important and certainly has a place in the new world of big data. When data enters the inner sanctum of an organization, management needs to be rigorous.
Yet, the challenge is that legislative data governance by nature is focused on risk avoidance. Often this model is still IT led. This holds progress back as the business may be at the table, but it isn’t bought in. This is evidenced by committee and project management style data governance programs focused on ownership, scope, and timelines. All this management and process takes time and stifles experimentation and growth.
So, this blog is dedicated to stepping outside the comfort zone once again and into the world of chaos. Not only may you not want to persist in your data quality transformations, but you may not want to cleanse the data.
Current thinking: Purge poor data from your environment. Put the word “risk” in the same sentence as data quality and watch the hackles go up on data quality professionals. It is like using salt in your coffee instead of sugar. However, the biggest challenge I see many data quality professionals face is getting lost in all the data due to the fact that they need to remove risk to the business caused by bad data. In the world of big data, clearly you are not going to be able to cleanse all that data. A best practice is to identify critical data elements that have the most impact on the business and focus efforts there. Problem solved.
Not so fast. Even scoping the data quality effort may not be the right way to go. The time and effort it takes as well as the accessibility of the data may not meet business needs to get information quickly. The business has decided to take the risk, focusing on direction rather than precision.
We last spoke about how to reboot our thinking on master data to provide a more flexible and useful structure when working with big data. In the structured data world, having a model to work from provides comfort. However, there is an element of comfort and control that has to be given up with big data, and that is our definition and the underlying premise for data quality.
Current thinking: Persistence of cleansed data.For years data quality efforts have focused on finding and correcting bad data. We used the word “cleansing” to represent the removal of what we didn’t want, exterminating it like it was an infestation of bugs or rats. Knowing what your data is, what it should look like, and how to transform it into submission defined the data quality handbook. Whole practices were stood up to track data quality issues, establish workflows and teams to clean the data, and then reports were produced to show what was done. Accomplishment was the progress and maintenance of the number of duplicates, complete records, last update, conformance to standards, etc. Our reports may also be tied to our personal goals. Now comes big data — how do we cleanse and tame that beast?
Reboot: Disposability of data quality transformation. The answer to the above question is, maybe you don’t. The nature of big data doesn’t allow itself to traditional data quality practices. The volume may be too large for processing. The volatility and velocity of data change too frequently to manage. The variety of data, both in scale and visibility, is ambiguous.