In my previous post, I discussed what big data means and how it should be implemented within an enterprise. This week, I would like to take a step back and go back to the basics and talk a little about data in itself. Before we can understand what Big Data is and how to process it best, we should understand what data is. A good way to understand the current state of data is to look at the past and understand how data has evolved over time. This in turn will provide us with insights on how data has been, is and will be stored, processed and consumed.
In the beginning…
The first recorded instance of digital data originates from the textile mills during the industrial revolution. Around 1725, two Frenchmen, Basile Bouchon and Jean-Baptiste Falcon used punched cards to control textile looms. This was further improved upon by another Frenchman - Joseph Marie Jacquard (1801) - who patented the Jacquard loom that ran on punched cards that stored information on the patterns used by the looms. The punched card was then improved upon by a Russian - Semen Korsakov (1832) - who improved the punched card for information storage and search and generously offered his findings and machines for public use. The punched card was also used by Charles Babbage (1822) to create "Number cards" for his Analytical Engine. Later, Herman Hollerith created the Hollerith card that was used in the 1890 US census. The punched card worked on the concept of binary data where a hole indicated a 1, while a blank space indicated a 0. The use of punched cards as the primary mode of data storage continued well into the twentieth century until the invention of the Williams tube (1946) - a precursor to Read Only Memory (ROM) and Random Access Memory (RAM) - and the introduction of the magnetic tape storage in the 1950s which led to the introduction of the first commercially available computer in 1951 (Ferranti Mark I).
Magnetic tape storage allowed (now digital rather than mechanical) computers to store larger volumes of data that was more persistent and accessible. This resulted in the evolution of the first model of digital data storage - the file. Files were primarily used to classify volumes of related ASCII text data that was stored on magnetic disks using various encoding techniques. As the use of computers became mainstream, a small explosion (in today's terms) of data storage occurred from the 1950s into the 1970s. However, given that the magnetic tape storage was expensive and required the use of expensive hardware to read and write files, only important data was stored. The files themselves were primarily containers to classify data. It did not provide any other meaningful intelligence to the data that was recorded. As magnetic tapes became more advanced, multiple files could be stored on the same tape. This eventually led to the creation of the second important concept of data modelling - the file system. The file system was a logical method used to store and classify information stored in files, so that recorded data can be retrieved more quickly that it could be from files themselves. The raison d'etre of digital data was primarily for archival at this point in time.
The need for intelligence
As organizations and commercial enterprises collected an increasing amount of data, the need for faster data retrieval became very important. Since the file system by itself could not help identify recorded data quickly, much time and effort was invested in the creation of more logical methods of data storage and retrieval. This ultimately led to the creation of the third model of data storage - the database.
By definition, a database is a mode of data storage where data is stored based on an entity relationship model - where the entity is the data in itself (or the logical container for that data such as a record/node) along with its relationship to other entities. The use of these models provided improved navigational access to the data that was stored. These entity-relationship models now allowed more intelligent storage and access to this data. The creation of the database brought with it the increased use of data for business intelligence. Three primary methods of entity-relationship models arose at this time - Network, Hierarchical, and Relational. Over time, the relational model became the de facto database model. The need for more reliable and failsafe data that could be queried much more efficiently, led to the creation of the Relational DataBase Management System (RDBMS) and the birth of Structured Query Language (SQL) to query the data more effectively. Multiple vendors - including pioneer Oracle - advanced the RDBMS over the next couple of decades to support real-time transaction processing and advanced querying.
The transition from file systems to the database also transformed the type of data that was collected and stored. Until the advent of the database, data was stored in an unstructured manner. There was no semblance of the kind of data that was collected - other than perhaps some descriptive data of the contents themselves. The database enforced the need for data structure. Not only was it important to compartmentalize the data itself, it was also important to characterize the type of data collected (numbers, text, date times, etc.). Although file systems were still in use at this point in time for archival of unstructured data, the majority of the data that was collected and used at this point in time was unstructured data.
Despite relatively high storage and operating costs, through the 1980s and the early 1990s, many organizations were happy to spend a large amount of money to acquire, ingest, store, and process data using the RDBMS. This was primarily because the use of an RDBMS allowed datafication - the ability to discern previously unseen patterns and relationships between data. As organizations were starting to gain more intelligence from the data they had stored, data was now becoming a critical asset within organizations. Two processes related to the management of data - the extract-transform-load and the extract-load-transform - were developed. Collection of larger volumes of data and the limitations of databases led to the creation of two more advanced data models - the Enterprise Data Warehouse and Object Oriented Databases.
Although Object Oriented databases were capable of handling large amounts of complex data, the widespread use of relational databases led to the quick adoption of the Enterprise Data Warehouses. The Enterprise Data warehouse was capable of handling much larger volumes of relational data as opposed to traditional RDBMS and also offered organizations the ability to transform and view the data they required in a number of ways that were earlier not possible through traditional databases.
In my next post I will cover the explosive growth of semi-structured and non-structured data, eventually leading to the birth of Big Data.
Please do let me know your views in the comments section below.