Data Quality Reboot Series For Big Data: Part 2 Persistence Vs. Disposable Quality

We last spoke about how to reboot our thinking on master data to provide a more flexible and useful structure when working with big data. In the structured data world, having a model to work from provides comfort. However, there is an element of comfort and control that has to be given up with big data, and that is our definition and the underlying premise for data quality.

Current thinking: Persistence of cleansed data.For years data quality efforts have focused on finding and correcting bad data. We used the word “cleansing” to represent the removal of what we didn’t want, exterminating it like it was an infestation of bugs or rats. Knowing what your data is, what it should look like, and how to transform it into submission defined the data quality handbook. Whole practices were stood up to track data quality issues, establish workflows and teams to clean the data, and then reports were produced to show what was done. Accomplishment was the progress and maintenance of the number of duplicates, complete records, last update, conformance to standards, etc. Our reports may also be tied to our personal goals. Now comes big data — how do we cleanse and tame that beast?

Reboot: Disposability of data quality transformation. The answer to the above question is, maybe you don’t. The nature of big data doesn’t allow itself to traditional data quality practices. The volume may be too large for processing. The volatility and velocity of data change too frequently to manage. The variety of data, both in scale and visibility, is ambiguous.

Your data quality efforts need to be defined more as profiling and standards versus cleansing. This is better aligned to how big data is managed and processed. While on the surface, big data processing is batch in nature, it would seem obvious to institute data quality rules the way they have always been done. But the answer is to be more service-oriented, invoking data quality rules that provide improved standardization and sourcing during processing versus fundamentally changing the data. In addition, data quality rules are invoked in a customized fashion based on customer service calls from big data processing.

Why this also makes sense is that when you do decide to persist sourced big data into your internal infrastructure, you have pre-aligned the data to existing policies for integration and business rules for improved mapping and cleansing that would need to persist. In essence you treat big data as a reference source, not a primary source. When have you looked to persist your data quality rules on reference data from a third party?

So, think about data quality in the context of supporting preprocessing with Hadoop and MapR through profiling and standards, not cleansing.

Up Next:

Reboot: Data quality and acceptable risk

Comments

Nice thought!

In other words, stop thinking of ETL processes as the only way to assure back end data quality.
One thing to think about is the use of virtualization technology to produce cleansed virtual views of data that reach back to raw big data sets.
Another related thought is the notion of "just in time quality". When you are operating on raw big data sets, I think we should be creating "just in time" quality services capability of doing just enough transformation and cleansing to get to the desired outcome.
Pig script is a great Hadoop-enabled big data transformation language, and I regularly hear about firms using it to cut big data transformation operations by an order of magnitude or more.