In my earlier post, I covered the basics of ensuring a successful big data project. Many firms have already invested a lot of money and effort into getting their big data projects up and running. Although some are successful, most of the current implementations are facing numerous issues around data and project related issues. Data issues primarily revolve around the quality and value of data, while project related issues revolves around the absence of a performance and assurance strategy along with finding the appropriate skills in the market for tasks that need to be performed on the market.
At a more granular level - current projects often suffer from data issues (quality, correctness, timeliness, etc.) detected late in the processing cycle, poor productivity on projects due to lack of automation, lack of a testing strategy for performance, and inability to perform routine tasks (such as full set data validation on large datasets). Therefore, I believe that the need of the hour on current big data projects is to have a holistic strategy to address these issues.
Big Data Assurance – What does it mean?
Big data assurance is about providing a strategy, deriving a process, and aligning the right tools and resources required to address the problem areas outlined above. However, in order to create a big data assurance strategy, it is not only important to understand the pain points observed in current implementations, it is also critical to understand the nature of big data implementation seen across enterprises. From my experience in the world of big data so far, I have observed that there are currently two primary flavors of enterprise implementations:
- High Volume, Velocity, and Variety of Data Repositories, and,
- High Performance Data Processing Engines.
Each flavor of implementation requires a different assurance strategy to address the issues that will be faced. For big data platforms used as data repositories (such as Data Lakes), primary area of concern relates to the correctness, completeness, and timeliness of the data stored from various sources. To ensure this, two primary tasks need to be performed - first, ensure that each data source provides data that is correct and complete (compared to the source), and second, ensure that the quality of data on the system meets the standards that meets governance and timeliness policies required.
Assuring quality on high performance hadoop platforms requires a slightly different approach in addition to the tasks associated with the first type of implementation. In order to assure quality on this type of implementation, it is not only critical to ensure the quality and correctness of data that is stored on the data system, it is imperative to test (functional and non-functional such as performance) the various algorithms that are written to cleanse, process, and transform the data that will ultimately provide the metrics, dashboards, reports, and, other consumables required.
Moreover, given that assurance tasks in big data implementations involves working with large and varied amounts of data, it is imperative to have an automation strategy to ensure that resources don’t waste large amounts of time and effort performing mundane, yet, critical tasks. Despite this, most implementations today primarily perform all testing tasks in the big data world manually - due to which a large amount of time and labor is spent on assurance tasks that should be automated – much as they have been in the application quality assurance world.
What do you believe about the need of the hour on your big data implementations? I look forward to hearing any divergent views about the topic as well. If you would like to understand or discuss the contents better, feel free to drop me a mail or leave a comment below this post.