Data Quality Reboot Series For Big Data: Part 3 Risky Data, Risky Business?

When you last pulled up a chair to this blog we talked about data quality persistence and disposability for big data. The other side of the coin is, should you even do big data quality at all?

So, this blog is dedicated to stepping outside the comfort zone once again and into the world of chaos. Not only may you not want to persist in your data quality transformations, but you may not want to cleanse the data.

Current thinking: Purge poor data from your environment. Put the word “risk” in the same sentence as data quality and watch the hackles go up on data quality professionals. It is like using salt in your coffee instead of sugar. However, the biggest challenge I see many data quality professionals face is getting lost in all the data due to the fact that they need to remove risk to the business caused by bad data. In the world of big data, clearly you are not going to be able to cleanse all that data. A best practice is to identify critical data elements that have the most impact on the business and focus efforts there. Problem solved.

Not so fast. Even scoping the data quality effort may not be the right way to go. The time and effort it takes as well as the accessibility of the data may not meet business needs to get information quickly. The business has decided to take the risk, focusing on direction rather than precision.

Reboot: Don’t worry about bad data. Precision is not always the end game, and the business is balancing risk with reward. Understand the decision process. Decisions are based as much about what the data shows as experience and anecdotal evidence. This trifecta is a balance, and data may be a catalyst or validator, not the only guide. To determine if data cleansing if required, consider time available, deviation of analytic results to perceived or accepted hypothesis, and risk within the context of data use. It may be that data quality really doesn’t matter and the data is good enough.   

However, don’t throw away your data quality best practices yet. Data quality measures and indexes created for data governance give you guide posts to build a trust continuum for data that helps determine when and when not to put data quality rules and efforts in place. Continuously profile data sources and the quality of data feeding analysis, not just to correct but to inform on when action is necessary.

Interested in more about the trust continuum? Read Alan Weintraub’s recent report on information governance.

Comments

very interesting point

quality for directional decision making versus quality for precision in decision making
Strategic versus tactical
analytical versus operational
high latency versus low latency

i believe quality routines must be written in a way logging mechanisms are written for applcation servers or databases - they need to follow the same paradigm design wise

we must be able to configure the level and precision in quality based on what data it is (completely trust worthy sources versus social dta that has nosie in it)

good breakdown - nice link to process

To your point, configuration is always the sticking point. Quality rules in one use case can be different for another. If you have data at the warehouse or application, are global routines causing challenges for "local" type of needs?

It is funny sometimes what we define as noise. For some it is junk. For others it is treasure. Goes back to operational and analytics in some ways.