So, this blog is dedicated to stepping outside the comfort zone once again and into the world of chaos. Not only may you not want to persist in your data quality transformations, but you may not want to cleanse the data.
Current thinking: Purge poor data from your environment. Put the word “risk” in the same sentence as data quality and watch the hackles go up on data quality professionals. It is like using salt in your coffee instead of sugar. However, the biggest challenge I see many data quality professionals face is getting lost in all the data due to the fact that they need to remove risk to the business caused by bad data. In the world of big data, clearly you are not going to be able to cleanse all that data. A best practice is to identify critical data elements that have the most impact on the business and focus efforts there. Problem solved.
Not so fast. Even scoping the data quality effort may not be the right way to go. The time and effort it takes as well as the accessibility of the data may not meet business needs to get information quickly. The business has decided to take the risk, focusing on direction rather than precision.
We last spoke about how to reboot our thinking on master data to provide a more flexible and useful structure when working with big data. In the structured data world, having a model to work from provides comfort. However, there is an element of comfort and control that has to be given up with big data, and that is our definition and the underlying premise for data quality.
Current thinking: Persistence of cleansed data.For years data quality efforts have focused on finding and correcting bad data. We used the word “cleansing” to represent the removal of what we didn’t want, exterminating it like it was an infestation of bugs or rats. Knowing what your data is, what it should look like, and how to transform it into submission defined the data quality handbook. Whole practices were stood up to track data quality issues, establish workflows and teams to clean the data, and then reports were produced to show what was done. Accomplishment was the progress and maintenance of the number of duplicates, complete records, last update, conformance to standards, etc. Our reports may also be tied to our personal goals. Now comes big data — how do we cleanse and tame that beast?
Reboot: Disposability of data quality transformation. The answer to the above question is, maybe you don’t. The nature of big data doesn’t allow itself to traditional data quality practices. The volume may be too large for processing. The volatility and velocity of data change too frequently to manage. The variety of data, both in scale and visibility, is ambiguous.