The term 'Big Data' is one of the most misunderstood terms in the IT world today. As the popular saying goes – everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, and so everyone claims that they are doing it. There is perhaps a good reason for this. A quick search on the internet will provide you with at least five different definitions of the term; each with their own set of "Vs" that attempt to categorize Big Data. Even worse are the deluges of articles that either eulogizes the need for organizations to get into big data ASAP or those that censure the big data as a hyped-up term that will provide no value to any organization. From my experience, I believe that the answer lies somewhere between the two extremes. In this post I would like to define the term big data, outline the steps that will lead to a successful Big Data project.
So what is big data? Big data as a term has evolved to an extent that it is indistinguishable from the hardware, software, tools, and processes that are used to process data. I feel that the original definition of big data with the three Vs is perhaps the best definition of the term. Big data essentially is data that cannot be processed using traditional RDBMS technologies alone. Any data that is too voluminous (including structured data like relational tables), too varied (existing in multiple formats), or collected too quickly to be cleaned and loaded into a traditional data storage system is big data. In order for the data to provide value to its consumers, it needs to be correct, complete and timely.
However, the most important thing to understand about big data is that it needs a problem to solve before it can be collected, examined, processed, and consumed. Ultimately, it is the problem statement that leads to the creation of a data model, which then leads to the exploration and processing that will unlock the true value for enterprises. Big data is only as useful as the data model that is used to draw inferences from. Unfortunately, the reason most people are disappointed with big data as a whole is because they try to process and consume data without trying to codify what they are hoping to answer/solve/decide.
I believe that all successful implementations of Big Data projects will share the following processes:
- Define: A lucid problem statement that can be facilitated by data is created (data should be used only be used to enable the decision maker to make decisions ... not vice-versa)
- Identify: Domain experts figure out what data needs to be collected (source, type, amount etc.), where it should be collected from (logs, SAP, etc.), and how it should be collected (the technical bit)
- Model: A scalable data model that makes sense of all the information is collected (exploration phase). It could take a while to find a model that makes sense with the sample taken. The sample should then be expanded to see if the model still holds. Lots of organizations usually run out of patience here and end up choosing a model that does not scale well
- Implement: The technology stack to industrialize the aggregation, ingestion, processing, and consumption of data is identified and implemented
- Optimize: Constant optimization of the data model and implementation processes is performed
What do you think about this? Please leave your comments in the section below.
My next post will revolve around how big data is processed by various organizations, the issues faced by organizations implementing big data, and the need for big data assurance as a practice on all big data projects.