For many companies, the use of Big Data platforms, Applications and Analytics has shifted from the periphery to an increasingly mission-critical role in business operations. Clustered data storage systems for hosting the vast amounts of data are often seen as offering high data availability right out of the box. In reality, uninformed use of these systems can often lead to data unavailability, defeating the very goal of high availability advertised by such systems.
In recent years, big data systems have proliferated as a result of the plummeting cost of storage - particularly cloud storage, coupled with a greater availability of computing resources. Recognizing the potential to harness insights and enable critical decision-making from their data, businesses across many industries are investing in big data systems and technologies. Consequently, the rise of big data and in particular unstructured big data, has added many levels of complexity to the essential activities that comprise data management.
Perception versus Reality
While businesses and data managers might perceive their data to be secure and highly available due to replication in the storage cluster, data on such “high availability” systems is still susceptible to accidental failure, loss or corruption.
As big data is ever increasingly put to operational use, there is a greater potential for inadvertent data loss through system failures or human error. Human errors can especially be damaging, as mistakes can rapidly propagate throughout a system before they are noticed. Multiple replicas of the same dataset can help alleviate some problems but also give rise to new sets of issues. Production applications accessing the system can slow down and storage costs will rise significantly as multiple replicas may require investing in and managing petabytes of storage capacity.
One widely held belief, in regards to new-age big data systems, is that the aggressive, built-in replication serves to protect data from any potential losses. With this approach for critical data protection, any change made to data - whether good or bad - is instantaneously replicated across the cluster. This can result in an error or corruption in the data instantaneously being replicated throughout the cluster - leaving no good copy left to work with. Therefore it is critically important to have point-in-time backups from which a recovery can be made. New-age capabilities, like Snapshot technology, allow for instantaneous point-in-time pictures of an entire data set. This removes the necessity of a coordinated backup operation during a period of low activity and allows for backup processes to occur while a system is up and operational.
Another instance of data being less available than expected occurs when developers access data containing sensitive personal information. Developers rarely get immediate access for security reasons. They are forced to wait while data is masked or encrypted, and only then is the data released to them. This inefficiency means the data is by definition no longer "highly available".
Ad Hoc Solutions Fall Short
Given the high cost of data loss, many organizations facing these challenges have developed in-house solutions. Engineers are typically tasked with creating, maintaining and editing custom scripts that ensure that data is secure and available. While developers might want data available to them at all times, company security policies, as previously mentioned, can dictate lengthy processes for protecting the sensitive information. To ensure that data is properly archived and stored, some large businesses are simply footing a large bill for the enormous storage space required for multiple replications of data.
These ad hoc solutions can work, but are far from perfect, and more importantly, out of reach for businesses without the budget or engineering talent. Storing petabytes is not only an enormous cost, but can also negatively impact production or network working conditions. Dedicating engineering talent to just write scripts that maintain data quality is a waste of human capital and a significant drain on resources for small or medium businesses.
In a structured data environment, data lifecycle management activities and their challenges are well understood. In contrast, unstructured data environments are still being studied and tested. Often organizations lack the specialized talent required to deal with these problems in the most efficient manner. One reason for this is that unstructured big data use cases are still emerging, so approaches and solutions to solving issues are still being developed. Another highly related reason is that experienced engineers are typically focused on either applications or infrastructure, and finding someone with a good understanding of both areas is rare.
Innovative companies have recognized the necessary convergence of Applications and Infrastructure, and are building products that are "Application-aware". Furthermore, big data management-as-a-service is a specialized and evolving segment of Software-as-a-Service. Just as businesses have outsourced networking and storage to Cloud Service Providers, they have begun to utilize dedicated service providers to manage their stores of big data. Utilizing big data management-as-a-service enables:
- Better Disaster Recovery procedures and policies
- More efficient Data Archival
- Easier Test/Dev data management
- Streamlined and more extensive Data Masking/Encryption
- Improved Data Migration processes
The benefits of deploying enterprise-grade Data Lifecycle Management solutions are many, with the primary effect being that businesses can focus on their core value drivers and operations while ensuring high availability.