The Data Modeling as a trade has been practiced in the IT world for many decades. As a concept, data model is a process to arrive at the diagram by exploring the data in question and getting a deep understanding of the data. The process to represent the data in a pictorial way helps the business and the technology experts to understand the data and also understand how it is going to get used. In addition, the relationships within data sets, which is pre-defined, determine in advance how the data should look like. However, with the arrival of NOSQL and Big Data there is neither time nor ways to represent data as we did before. Moreover, there is also the question of how and when the data model for big data will be required and how it is going to get used.
Data variety and representational challenges:
In the past, during the process of data modeling, we organized the data in the way we want to use and then deployed them in the database stores. But with unstructured data today, defining the context and the process of relating the data pieces has moved to a different stage. In other words, the data is loosely coupled to the way it is going to be used and the representation of the data gets different context based on who uses it. For example, machine data will be seen in different dimension by the marketing team and the engineering team. Hence building a single dimensional representation of data is a challenge with the unstructured and NOSQL. Here is where the traditional way of doing the data model first and storing the data later becomes a challenge.
Are new data models buried in the application code?
How we read and interpret data from Mongo DB or from a Hadoop File or the procedure that a data scientist writing an R program follows to get his answers from Data Lake may define the new data model. In other words, built-to-use apps like what Teradata Aster provides a home grown algorithm to build pattern matching data models which are defined and used dynamically on the fly. The process of binding data on the fly and using the data is significant only if it answers the question of interest. So do we require to document and prepare diagrams for those hidden algorithm buried in the complex map-reduce programs? That is going to remain as a question for now.