Unified data management with Delta Lake and Google Cloud

A global consumer goods company was facing acute business challenges such as delayed product recommendations, deferred billing and SLA non-compliance despite 1000+ node big data cluster. The firm had a storage of 23 petabytes of data with transaction tables as large as 90 TB, with billions of billing transactions that had to be traced while amending or cancelling any order. The 18-step discovery process with 20+ tables, that had to be joined, was becoming very inefficient. Furthermore, each business transaction was a multi-record transaction requiring time travel and backpropagation, and hence a complex match and merge logic.

This is a common challenge faced by global organizations wherein traditional data lakes fall short. What comes to the rescue in such a scenario is a Delta Lake integrated with a highly scalable and tuned cloud solution.

Architecting data pipeline with Delta Lake and Google Cloud

Delta Lake, an open source storage layer, incorporates ACID features in data lake, along with time travel, history tracking, metadata management, and governance. This collection of libraries and routines can leverage Amazon’s Simple Storage Service and Hadoop Distributed File System (HDFS) as default storage layer. Google Cloud Platform (GCP) can be used for the same. Figure 1 depicts how GCP and Delta Lake can be leveraged to build a unified data processing and consumption layer.

Unified data management with Delta Lake and Google Cloud

Figure 1: A unified data processing and consumption solution built on Google Cloud and Delta Lake

Creating single data narrative

The different layers of the unified data processing solution (Depicted in Figure 1) are

Data Sources
The Unified Data Platform Setup
Unified Data Ingestion & Processing Layer
Unified Data Layer
Unified Data Consumption Layer

Data Sources: The source data for a typical billing schema consists of tables such as

Transaction Tables: Purchase Line Items, Cancellation Line Items etc.
High Volume Masters: Account Master, Customer Billing Profiles etc.
Reference Tables: Geo Code, Cancellation Code, Product, Vendors etc.

Unified Data Platform Setup: Delta Lake leverages Spark framework for its processing. Google’s Dataproc comes with built-in Spark installation. It also has built-in Google Cloud Storage (GCS) connectors that reduce the deployment time considerably. Appropriate version of DataProc instance should be deployed and appropriate decencies should be added to pom file.

Unified Data Ingestion & Processing Layer: This layer can be developed using tools such as Google’s DataFlow or Delta enabled ETL tools like Talend

Initial Conversion: The initial load/delta-conversion process for high volume transactional tables such as Purchase Line Items, Cancellation Line Items can be optimized without generating any statistics that would limit the metadata generated. Other tables can be loaded/converted with or without stats based on the needs. The converted tables can be optimized later
Incremental Batch/Streaming Loads: Delta lakes can manage both batch and streaming data seamlessly. Incremental data can be further ingested as per business logic

Unified Data Layer: The GCS buckets can be configured as storage layer for Delta Lake. The core business data gets stored in parquet format and logs will be maintained in json format. Further, to optimize query performance at this layer, data can be collocated using transaction-date based optimization and z-ordering on key fields such as order-codes/ cancellation-codes etc. This layer will be able to enable full ACID compliance with highly optimized querying / time travel on the data.

Unified Data Consumption Layer: Based on the nature of the data and access requirements, Google’s BigQuery and BigTables can be leveraged as the storage layer.

High volume, low latency tables having fixed access patterns (reference tables such as Customer Billing Profile) can be stored in BigTables while others can be stored in BigQuery.
Other BigQuery best practices such as portioning/clustering, query caching, join optimization, lifecycle management of Delta data and GCS buckets etc. can be implemented to further optimize storage and access of the data.

Effective data management

Organizations need actionable data. To fit the bill, data must provide a single version of the truth along with full governance capability. The enables decision makers to trust data and make insightful decisions. Delta Lake along with GCP provides an effective way to manage and control data.

Industry :

Data, Analytics & AI

About the Author

Rahul Sarda
Distinguished Member of Technical Staff at Wipro.

He has over 20 years of experience with deep technology and functional domain expertise. He has helped organizations across industry verticals develop value propositions for faster time to insights.

Ajinkya Chavan
Google Certified Professional Data Engineer.

He has over 18+ years of IT experience in the areas of analytics in architectural and consulting positions. His areas of expertise include Big Data, Cloud, DWH , ETL, Data Migration etc.

Vijay Kardile

has over 17 years of enterprise IT experience, including consulting, application and technology development spanning multiple industry segments and diverse technology areas. His areas of expertise include Big Data, Cloud, DevOps and building innovative solutions using Open Source technologies.

Unified data management with Delta Lake

and Google Cloud

About the Author

Related Blogs

AI Driving 5G Innovations for Communications Service Providers

Making ML Models in Banking Resilient using Adversarial Attacks

Is your Mobility Landscape Harnessing the Full Potential of Mobility Managed Services & Security?