When does 00100011 change from data to #bigdata?
Unless you’ve chosen to live your life completely off the grid, Ted Kaczynski-style (the fact that you’re reading this blog entry negates that possibility!), someone, somewhere, is using big data technology to try and understand more about you.
It may be a retailer looking to give you a better customer experience (my colleague Renato Stoffalette Joao provides some good use cases); a campaign staff using what it knows about you and millions of your fellow citizens to craft a winning election strategy (check out the Obama campaign’s brilliant use of big data); or a government sifting through your records in the interest of public safety (as with the US National Security Agency’s recently disclosed PRISM program). I have little doubt that big data will drive unprecedented change in our lives, whether we like that change or not.
I just wish that someone had come up with a different name, because the term big data carries the implicit meaning that everything else is, well, something other than big (small data? little data?) and, consequently, less important.
According to the Unicode/UTF-8 standard, 00100011 is the binary representation of the hash symbol (#). The bit sequence 00100011 represents # no matter what technology you use to generate it, store it and process it. However, under some circumstances this string of bits is considered to be big data and under other circumstances it is not. There’s nothing in the encoding that makes the distinction, so just when does 00100011 change from just data to big data?
Google “What is big data?” and you will be presented with over two million explanations to ponder. Most definitions—including IBM’s—tend to define big data as a class of data that exhibits particular characteristics in the areas of volume, velocity, variety and veracity—“the four Vs”:
- Volume. So much data is being collected that we’re now starting to project the worldwide accumulation of digital data in terms of Zettabytes.
- Velocity. Data is being produced so quickly that we tend to think of it in terms of continuous streams as opposed to repositories of discrete events.
- Variety. Data is being recorded in a wide range of structured and unstructured formats such as transactions, sensor feeds, Tweets, email and video.
- Veracity. Not all of this data can be fully trusted for any number of reasons (sensor precision, unverified sources, ambiguous text and so on).
Why does this matter? Isn’t a # just a #?
Traditional data processing technologies—querying a DBMS with SQL, for example—are considered to have limitations in the volume, velocity and variety of data that they can efficiently process. And they are designed to work with trusted data. When these limits are exceeded, different technologies may be required. The Hadoop MapReduce system for processing large data sets using parallel technologies often comes to mind when one thinks of big data.
But what, exactly, is the breaking point at which this technological shift occurs?
Unfortunately, there are no firm guidelines as to how big, how fast, how varied or how truthful data must be before it is classified as big data. I have a client who currently maintains records on 2.5 billion transactions in an active, online, fully accessible DB2 z/OS database. Another just put in place the ability to perform accelerated ad-hoc querying and reporting against over a petabyte in mainframe warehouse capacity. A credit card processor manages a sustained rate of 750 authorizations per second with a 250-millisecond response time on their mainframe. I have many more such examples; by most measures, there is some serious big data being processed on the mainframe every day. But it’s not being processed by what are considered to be big data technologies.
Value – the fifth dimension?
When characterizing data, I think that we need to explore one more “V” to complete the picture: value. If you wanted to know more about me and my habits, and you were able to hack my security credentials and gain access to my computer, you’d tap in to around 120 GB of digital images (yes, I take a lot of pictures!), a trove of data that is a clear candidate for big data analysis using parallel technologies.
You would also find an 80 MB file containing my Quicken personal finance data. This relatively small file of structured data contains records of all of my financial transactions dating back to 1995 (yes, I am also a very conservative record keeper!). I can’t think of anyone who would classify this as big data. Yet, you would learn far, far more about me by analyzing my Quicken data than you would by analyzing my pictures. The Quicken file—my core financial data, or “book of record”—has greater analysis value than my big data photographic files.
Making 00100011 work for you, whether it’s big or not
Fortunately, you’re not faced with an either-or decision when it comes to big data technology. I recommend to my mainframe clients that they first understand the questions they want answered, then identify the data that will be most valuable in answering those questions. Analytics processing should be positioned as close to that data as possible. It should come as no surprise that this high-value data is nearly always hosted on the mainframe. If you haven’t already done so, please check out my MythBuster blogs, Episode I and Episode II, for ideas on how deploy analytics into your existing mainframe environment.
Once your core analysis plan is set, look for other sources of data that can augment and enhance your results. This will very often include data that is best processed with MapReduce and stream technologies. In future blog posts I’ll discuss how to condense and connect big data insights into mainstream analytics processing.
Returning to my personal example, an analysis of my financial data would reveal that I travel a lot, and would provide the tangible details of every trip. But you wouldn’t know anything about the quality of my travel. Correlating an analysis of my photographs to my financial records would give a more complete view of what I did while traveling, and a further analysis of my Facebook and Twitter feeds would let you know how I felt about the experience. This core analysis of my finances, augmented by a “big data” analysis of ancillary data, would provide a complete picture of my travel habits and position you to make offers that I would find appealing.
As long as the standards don’t change, the bit stream 00100011 will always represent the # symbol. Whether # signifies a big data Twitter hashtag or a not-big data customer purchase order, learning how to match the characteristics of the data to the characteristics of your systems is key to successfully turning that data into actionable insights.
Share your thoughts below or connect with me on Twitter.
Paul DiMarzio has 30+ years experience with IBM focused on bringing new and emerging technologies to the mainframe. He is currently responsible for developing and executing IBM’s worldwide z Systems big data and analytics portfolio marketing strategy. You can reach Paul on Twitter: @PaulD360.