Pondering on the topic of this blog, I became intrigued by the notion that greater data variety brought on by Big Data will trigger the next wave in data quality innovation. Organizations will need to determine how they intend to garner “insights of interest” from Big Data out of what they consider to be noise in their expanding data universe. So let me tell you a related story to explain my thought process.
In the not-so-distant past (1997 to be exact), second generation ETL tools such as DataStage propelled the ETL tools market into the data integration solutions we know today; back then, companies were eager just to Extract, Transform and Load their data in a timeframe that mattered. Quickly the race was on to add innovative ways to pull enterprise data from mainframe systems, enterprise applications, message brokers and new-fangled file formats such as XML and Weblogs. The goal was to create a consolidated view of heterogeneous data in a data warehouse that supported business analytics.
However, due to the growing data volumes that resulted from this thirst for data, a move to parallel processing-based ETL architectures occurred. Also, with the growing familiarity of what the data meant to the business, logical errors were found in the data itself. The data was sometimes ambiguous, inaccurate or had redundancy issues. It was difficult to say whether data in one silo was truly comparable (or the same) as data from another silo (e.g., across multiple insurance policies or banking accounts). So data wasn’t so reliable after all, the data needed to be cleansed – standardized, matched and deduplicated – if the CFO’s reports were going to be right.
Thanks to innovation and industry consolidation, we moved from separate data quality and ETL tools and processes into single integrated ETL and data quality platforms. This helped to streamline and scale the data quality function, making it routine to obtain enterprise data that was trusted and comparable across silos. The secret sauce was the ability to apply data quality disciplines to the data in a tireless and predictable manner on the data flowing around and into an organization, such as bills of material, patient, vendor and supplier data, address information, survey results, and generics information; the make up of the data was infinite. Although much of the demand for data quality started with postal rebates, or product information management, or householding, or geoencoding, or eliminating redundant data, or simply linking disparate data from multiple systems, the trick was speed, accuracy and dealing with large data volumes.
What we learned was that data quality is contextual:
1) The definition of trusted data depends on the purpose to which you apply the data to the business,
2) the value of what the data represents to the business, and
3) the rigor you need to apply to the data to ensure it can be trusted.
Now back to the present day and the recently announced IBM platform for Big Data that dovetails the two worlds of Enterprise Data and Big Data within an Information Supply Chain to support an organization’s information strategy. BTW, I’m over-simplifying here to make my point, bringing Big Data (of greater Volume and Velocity) into enterprise systems such as a data warehouse through enterprise integration requires you have comparable data. For example, reconstructing seemingly dissimilar pieces of data about a consumer’s online garden furniture purchase with their prior retail purchase patterns, their tweets and Facebook posts about their selection process (and friends’ opinions), and their customer service experience is a fairly straight-forward challenge whether the data is at rest or in motion, albeit on a larger scale. Plus, there is a common thread in the data: the specific information related to an individual’s credit card and purchase history, information about them from a data service provider, their email, Twitter and Facebook IDs.
In my opinion, dealing with a greater Variety of data is a more challenging data quality dimension to consider. It begs the question, “What constitutes accuracy, reliability, consistency and comparability of Big Data with data types such as audio, video or sensor data?” You could say signal processing is a data quality technique of sorts used to identify and act upon a precise data signature extracted from the noise of a continuous flow of analog data. Alternatively, identifying a specific person, place, action or object in a vast quantity of text that is misspelled 19 different ways (in multiple languages) is like looking for a very small needle in a huge haystack.
As I’ve explained, leveraging Big Data within the enterprise requires data quality techniques to be used as part of the enterprise information integration process. However, I believe there will be another wave of innovation in the area of data quality as organizations push the boundaries of Big Data in their pursuit of “insights of interest” and competitive advantage from the rapidly expanding data universe and unusual data sources in a timeframe that matters.
To effectively compete in today’s changing world, it is essential that companies leverage innovative technology to differentiate from competitors. Learn how you can do that and more in the Smarter Computing Analyst Paper from Hurwitz and Associates.