Big data technologies like MapReduce and Stream Processing will completely change the way we think about computing. For the last 40 years, since the relational database changed the world, we have settled for an approach of “extend and contain” centered on the relational model. We “extend” by developing new technology to make the relational model applicable to a wider range of problems – examples include partitioning strategies to scale, or solid state disks to improve performance. We also “contain” by scoping problems to fit the technology that exists.
Until now, “extend and contain” was the best, and only feasible approach, for addressing data management challenges. Today, technology innovations and market forces are leading us to new big data approaches for problems with high volume and velocity data. These approaches will enable us to collectively raise our expectations about the kinds of applications we deploy in our enterprises. We will expect to see and understand more of our data, act on data immediately as it enters the enterprise, and take advantage of new data to gain competitive advantage. We will architect our processes to apply the right tools for the job whether it be relational, streaming, MapReduce, or new approaches yet to be invented.
To understand the transformative potential of big data, it is useful to look back at the Industrial Revolution. The Industrial Revolution was an incredible period of innovation spanning nearly 200 years from the mid 1700s to the early 1900s. The innovations of the Industrial Revolution impacted almost every sector of society—transportation, communication, farming and manufacturing. Many of these innovations were incremental—perhaps a machine to replicate the way a human performed a single task or an improvement to an existing machine. But many others, like the development of steam power and later electricity, had a wider impact. These technologies built on each other and made it possible to periodically achieve major leaps in progress.
Perhaps one of the most significant innovations in manufacturing during the Industrial Revolution was the use of interchangeable parts and assembly lines that has been famously associated with Henry Ford in the mass production of automobiles. The idea of continuous, assembly line manufacturing had been tried before with some success, but Ford used it at just the right time when machines had reached a state of sophistication, accuracy and precision that made the approach practical, and when the demand for automobiles drove the need for much more efficient manufacturing methods.
Assembly lines completely transformed the manufacturing process; an example of this was the notion that focusing the right skills to the right jobs would lead to the production of more “widgets” per unit time with the same number of resources. For illustration, consider the manufacturing of cars. There may be one mechanic who installs the frame and body, another who installs the wheels, another who installs the transmission, and a final mechanic who installs the engine. Assembly lines and interchangeable parts allowed each mechanic to focus on his or her area of expertise and eliminated the overhead associated with specialization. Rather than building one car at a time as was done before, cars were moved through a logical sequence of operations so that each mechanic was always working on some car at some step in the sequence.
This sequencing and continuous assembly means that if each of four major operations takes one hour, a car rolls off the assembly line every hour rather than every four hours when a single car was completed before moving on to the next car. Assembly lines also bring the right parts and people together to perform the next step for each car, further reducing overhead for the workers and improving efficiency. This approach also had the effect of promoting optimization and automation of each step in the process. In the end, the development of the assembly line changed more than just the automobile industry. From candy bars to cars, it changed the way virtually everything is manufactured.
It is important to note that assembly line technology was not universally accepted or appreciated when it was first used. Improved efficiency meant more production, but it also meant fewer workers were required for the same level of production. Furthermore, automation and specialization meant that less skilled workers could be employed for the same tasks, some jobs were completely eliminated, and others were scaled back or merited lower pay. There were significant costs to setting up an assembly line, making sure the process was right, and making adjustments as requirements evolved. Workers, and anyone building a business established on technique preceding the assembly line were reluctant to change because it was so disruptive. Yet, for all its disadvantages, people soon recognized that more efficiency meant accelerated production and greater profits. This shift also led to the creation of higher value jobs; as the complexity of what could be built and achieved increased, more engineers and scientists were needed to invent and optimize new technologies.
Now consider the Information Revolution. We have had major advances in areas such as storage, memory, multi-core processors, networking, computer languages, clusters, compiler technology and development environments. We have hit major plateaus of progress with technology like relational databases, web application servers and social media. Like the Industrial Revolution, the volume of data we are facing today (as in the need for mass production before) is driving us to abandon “extend and contain” approaches and to build more efficient methods. All prior innovations of the Information Revolution, coupled with the need to address greater volumes and varieties of data, have led us to create an “assembly line for data.”
An “assembly line for data” has many similarities to the assembly line for manufacturing. The end product of manufacturing, whether a car or a candy bar, can be sold or used as more valuable than the sum of its parts. Similarly, an “assembly line for data” can be used to synthesize insight that can be sold or used as more valuable than the sum of its parts. More importantly, an “assembly line for data” is a more efficient method for processing data as it enters the enterprise. Rather than storing data and periodically processing it in a batch, data can be processed as it arrives. Just like the assembly line for manufacturing, data can be passed through a pipeline or assembly line of operations, bringing the right data and operations together at the right steps in the process.
Processing in a pipeline means that the operations can be scaled out to more servers or cores because each core can focus on its own operation in the sequence. Like an assembly line, steps that take longer can be further broken up into substeps and parallelized. Because the data is processed in an assembly line, much higher throughputs can be achieved, and because it is processed as it arrives rather than later in batch, initial results are available much sooner and the amount of “inventory” or enrichment data that needs to be kept on hand is decreased.
An “assembly line for data” has several benefits not found in manufacturing. First, “assembly lines for data” can be modified nearly instantaneously by upgrading operators in the sequence or by completely changing the sequence of operations. This quick change means that the process can be continuously optimized or customized based on changing data, changing resources, or changing needs of the organization. Additionally, because all data can be reduced to binary sequences, the base assembly line technology can be the same and dynamically adjusted based on the size and weight of the data. The same data conveyor can be used for stock quotes and blood pressure data, whereas candy bars and cars typically use different types of conveyors suited for their relative sizes and weights because it is not practical to make dynamic adjustments.
As part of our big data platform, IBM has developed stream processing technology to address the emerging requirement for more timely results coupled with greater volumes and varieties of data. Our IBM InfoSphere Streams product is the “assembly line for data” of the Information Revolution, and it was made possible because of all of the technological advancements during the Information Revolution that created the conditions necessary for it to exist. The continuous pipeline processing approach is not new to computing—in fact it has been used in microprocessor design already with very good results. What is new in stream processing is the higher level of abstraction and making it available at cluster scale.
InfoSphere Streams also includes many elemental building blocks such as text analytics, data mining and relational analytics, to make it easier for enterprises to build “assembly lines” for business insight. Just like the physical assembly line technology, there will initially be some inhibitors to adoption because the technology is so disruptive. However, like the physical assembly line, we believe the approach is so much better for certain business challenges that it will be embraced and used widely. Enterprises are just starting to realize that an “assembly line for data” is possible – with a few more high profile successes like Ford’s physical assembly line, it will transform the industry.
The question every CIO should ask is whether the enterprise is getting everything out of its data that it should be, and whether the organization is able to extract insight from its data as fast as it should be.
Imagine the business impact of being able to analyze more data than before, faster than before. What if it suddenly became cost effective to analyze data that was too time sensitive or expensive to analyze before? How much better could our customer relationship be? Would we be able to consistently identify new market opportunities before competitors? Would we be agile enough to respond to changes in the market as they are happening?
There is tremendous potential in applying the same automation to data that we have applied in the physical world. Do you want to lead the revolution?
To effectively compete in today’s changing world, it is essential that companies leverage innovative technology to differentiate from competitors. Learn how you can do that and more in the Smarter Computing Analyst Paper from Hurwitz and Associates.