Data Lake is highly relevant as a single repository of data in original source or raw format for the entire enterprise. The data from data lake is used as source for various purposes in the organization like reporting, analytics, compliance, visualization, machine learning etc.
Data lake includes different types of data – both text and binary. It also includes data in various formats like log, csv, json, xml, parquet etc. Data is usually loaded into data lake “as is” without any transformation. It is stored in various mediums like file system, RDBMS, NoSQL database, graph database, time series database, block chain ledger database, in memory cache database and analytical database. Any transformation of the “as is data” is done as per the need of consumption side downstream to data storage changing the paradigm from ETL (Extract, Transform and Load) to ELT (Extract, Load and Transform).
Data Lake is important because:
- The raw data from entire enterprise is easily available for any analytical needs – current and future.
- When newer and more powerful tools become available, the data lake allows discovery of new insight from existing data.
- The easy availability and large scale allows lot of experimentation.
- Since the data is in original or raw format, it is not contaminated and hence, possible to look back and fine tune earlier findings and develop new insight from historical data.
- The transformation strictly for consumption ensures appropriate cost benefit analysis at every step. The investment beyond experimentation is done only if there is a business need and no unnecessary large-scale development or operation takes place if the value is not apparent from day one.
Why on AWS?
Enterprises are increasingly moving away from creating data lakes on premise to public cloud (particularly on AWS).
Some of the reasons are:
- Drastically lower cost: Businesses pay for only the period/capacity they actually use. No need to pay for idle capacity. Also the massive scale of AWS makes equivalent resources available at much lower cost compared to individual small quantity purchase by every individual business.
- No upfront capital investment: Payment is on a periodic schedule based on the actual usage during the period. High upfront capital investments are not required. This, along with lower cost, allows taking many initiatives that were hitherto unthinkable in traditional on premise development.
- Drastically less infrastructure management requirement: Companies are not required to make huge capital and manpower investment on infrastructure management any more. Amazon does the same for them. Since businesses have the benefit of scale, they can ensure better service too.
- Faster time to develop and market: The required hardware and software tools are just a click away on a cloud, freeing the team from following a time and effort consuming procurement, build and certification process with long lead times. This results in quick turnaround time for any development.
- Best tools for available budget: In traditional scenario, lot of emphasis is given on IT tools rationalization in the interest of upfront high capital requirement and already sunk cost on tooling that results in force fitting architecture into the existing tool set that may not be the most suitable for new requirements technically. Due to huge variety of latest tools availability and pay per usage without capital lock in, there is no need to compromise on sub optimal components any more.
- More experimentation, less guesswork, quick pivot: The quick availability of a large variety of tools at low pay per usage cost allows for high number of proof of concept experimentation. This way, the dead ends can be identified and discarded early and higher focus can be given to more promising forward paths. This reduces the guesswork in of business decision making in terms of betting significant amount of corporate resources for future development.
- Higher availability and quicker disaster recovery: Since Amazon can afford to maintain multiple data centers in each region due to their massive scale, the customers automatically get the benefit of geographical spread. It also offers and encourages low cost, low maintenance replication of resources and horizontal scaling. This results in higher availability, more resilient architecture and quicker disaster recovery.
- Faster scalability: The scale of available hardware and software in a public cloud and quick deployment allows an application to roll out very quickly from experiment scale to web scale in just a few clicks of button. The offering can scale both vertically and horizontally as such options are easily available on AWS on cost-effective pay per usage.
- More geographic coverage: Unlike on premise data centers, AWS offers global footprint and assets can be located near to the intended users giving lower latency. Also some regulatory compliance requires data to be pinned in particular region and the global footprint easily allows the same.
- Latest feature/functionality updates for the underlying platform: Due to benefit of scale, AWS introduces the latest features and functionalities in the underlying platform quickly after thorough testing for stability. This is not easily possible on premise.
- Better security for data storage and access: Again, due to higher scale, AWS can invest a lot more for security hardening than individual customers. Thus all customers get benefited in terms of higher security and compliance.
- The proven track record of AWS: According to the 2019 Gartner magic quadrant, for the ninth year in a row, AWS has been evaluated as the Leader in cloud IaaS with the highest score in both axes of measurement, Ability to Execute and Completeness of Vision.
Serverless architecture is a new deployment paradigm where application components get deployed on compute infrastructure only at the time of usage. The compute infrastructure is owned and operated by the cloud provider, thus removing all burden of server procurement, configuration and management from the customer.
Over the last couple of years, the serverless architecture has gained significant ground and there is an increasing trend to deploy applications in serverless fashion.
The data lake is not outside this pattern. In fact, the intrinsic nature of data lake gives various unique advantages when deployed as a serverless architecture.
The advantages include:
- Cheap storage: A data lake requires a massive amount of storage with extremely high availability, durability and low latency. A serverless storage solution like S3 is ideally suited for that due to its lower cost.
- Just in-time large-scale compute: The higher volume of data needs huge scale of compute power for regular as well as experimental processing. Going for fixed server infrastructure for such high capacity would mean lot of idle time and hence, wastage. AWS offers various kinds of serverless compute services like Glue, Lambda and Fargate for different use cases, enabling share of underlying compute platform with consequent lower cost of usage.
- Near zero operation maintenance: Since the underlying infrastructure is completely owned and operated by AWS, there is almost no maintenance burden on the customer other than operational monitoring of the service despite the massive scale of data and processing.
- DR and BCP available by default: Again, the customer does not need to plan separately for high availability. Disaster recovery (DR) and business continuity plan (BCP) are present by default in the serverless platforms. If not serverless architecture, planning and execution of this would have been a serious concern due to the scale of a data lake.
- Plethora of serverless components on AWS: The availability of tools plays a key role for adaptation of serverless architecture and AWS provides multiple choice of high quality serverless components for every stage of a data lake.
- Apart from the examples of compute platforms given above, AWS provides Kinesis Firehose and analytics for real time data processing, Athena for querying, API gateway for API management and so on.
- Other than object storage in S3, it offers Dynamodb for NOSQL data, Arora-Serverless for RDBMS and EFS for block storage.
- For operation and governance, it provides CloudWatch, CloudTrail, SNS, SQS etc.
- For security and compliance there are tools like IAM, Secrets Manager and Macie and many more
All such tools are essential to develop and manage a data lake and if they were not serverless, it would have costed a lot due to high idle time.
A modern organization has to use every advantage at its disposal to survive and thrive. No organization today can ignore the wealth of data at its command. A data lake gives unmatched flexibility for unlocking the analytics potential of data. Creating a serverless data lake on AWS is the most optimal path in that journey.