3 Ways Data Preparation Tools Help You Get Ahead Of Big Data

The business has an insatiable appetite for data and insights.  Even in the age of big data, the number one issue of business stakeholders and analysts is getting access to the data.  If access is achieved, the next step is "wrangling" the data into a usable data set for analysis.  The term "wrangling" itself creates a nervous twitch, unless you enjoy the rodeo.  But, the goal of the business isn't to be an adrenalin junky.  The goal is to get insight that helps them smartly navigate through increasingly complex business landscapes and customer interactions.  Those that get this have introduced a softer term, "blending."  Another term dreamed up by data vendor marketers to avoid the dreaded conversation of data integration and data governance.  

The reality is that you can't market message your way out of the fundamental problem that big data is creating data swamps even in the best intentioned efforts. (This is the reality of big data's first principle of a schema-less data.)  Data governance for big data is primarily relegated to cataloging data and its lineage which serve the data management team but creates a new kind of nightmare for analysts and data scientist - working with a card catalog that will rival the Library of Congress. Dropping a self-service business intelligence tool or advanced analytic solution doesn't solve the problem of familiarizing the analyst with the data.  Analysts will still spend up to 80% of their time just trying to create the data set to draw insights.  

Companies like Paxata saw this problem and set out to eliminate it, not with a backend data integration and data management approach, but with a front-office data preparation tool that connects subject matter experts intimately with their data.  The point of data preparation tools is three-fold:

  1. Embracing schema on read defined by the business, not IT.  Big data creates big exploration and makes enterprise data models obsolete.  IT can't anticipate, define, and build data models that keep pace with the infinite queries, analytic iterations, and business changes that affect the creation of data sets for analysis.  Schmas has to be created by the business and analysts and connected to what they want to achieve with the data.  Data preparation tools enable this by using machine learning and artificial intelligence to define schemas across aggregated data sources and provide a spreadsheet like environment where data professionals can quickly and easily refine the data to the intended use.  
  2. Data stewardship becomes part of doing business.  The idea that data scientists would have to succumb to data governance activities was a big data killer out of the gate.  In fact, when you look at data governance for Hadoop today the most mature aspect is security, not the quality and consistency of the data.  However, that isn't to say that quality and consistency didn't matter.  Analysts, and data scientists in particular, work out data bugs as they prepare data sets.  Data preparation tools recognized the data citizenship occurring and delivered a better platform that further empowers data stewardship actions but aligns with how analysts think and interact with the data.  This keeps data aligned with the semantics of business language and nomenclature, not data systems.
  3. Transparency and collaboration catapult big data operational systems.  If data sets are built to create real time fraud detection systems, next best action for customer engagement, or optimization of manufacturing processes on plant floors, how data is prepared can't happen in a vacuum.  The old way of migrating from analytic to operations was a lengthy process of business subject matter experts and analysts sitting in lengthy interview sessions with IT business analysts and enterprise architects to define requirements.  Data preparation tools cut this process down by capturing data preparation steps that IT can take and translate into a production environment.  Even as analysts and business stakeholders optimize analytic models and include additional data, IT still has access to these changes and can adapt systems more easily to keep pace with changes.

It may be sexier to think of data preparation tools as big data analytic solutions.  Yet, that would be missing the complete value and relevancy these tools have in the bigger picture of getting control and competency with big data for more than data science activities and limited operational implementations.  Data preparation tools are the catalyst to bringing trust, speed, and actionable insight for all data where traditional data governance and management tools have hit the wall.

Check out Forrester's report on data preparation tools and find out how three data professional roles will be transformed by data preparation tools.



Michele very informative

Michele very informative article. With the explosion of big data, companies are faced with data challenges in three different areas. First, you know the type of results you want from your data but it’s computationally difficult to obtain. Second, you know the questions to ask but struggle with the answers and need to do data mining to help find those answers. And third is in the area of data exploration where you need to reveal the unknowns and look through the data for patterns and hidden relationships. The open source HPCC Systems big data processing platform can help companies with these challenges by deriving insights from massive data sets quick and simple. Designed by data scientists, it is a complete integrated solution from data ingestion and data processing to data delivery. Their built-in Machine Learning Library and Matrix processing algorithms can assist with business intelligence and predictive analytics. More at http://hpccsystems.com

Very well said!

Very well said and informative content shared. Data management is a crucial foundation of your professional work. The data you collect and analyze are a national resource. Stewardship equals taking responsibility for a set of data for the well being of the larger organization, and operating in service to, rather than in control of, those around us. Data stewardship is primarily the job of the professionals who create and maintain data. Although they have significant support roles to play, stewardship cannot simply be delegated to the IT or GIS shops. More about bigdata at https://intellipaat.com/

Michele, thanks for an

Michele, thanks for an insightful analysis. Another dimension here is the effect of data preparation (ingestion, cleansing, updating, integrating, etc.) on the underlying infrastructure. Companies should monitor and tune data placement to minimize bottlenecks. They can consider investing in software that monitors workloads, and use it for example to identify the tables, columns, etc. that are most heavily used and therefore might be appropriate for Hadoop rather than an EDW. They also can invest in software that streamlines data moves across platforms, eliminating CPU bottlenecks, by automating previously manual coding requirements, using change data capture for subsequent updates.
- Kevin Petrie, Attunity

free data preparation tool

Very useful information. Though I feel data preparation, feature engineering, data wrangling all can be in combined into superset data quality. Data Quality is not only about data cleaning.

this project is bringing all those into one