Creating single data narrative
The different layers of the unified data processing solution (Depicted in Figure 1) are
- Data Sources
- The Unified Data Platform Setup
- Unified Data Ingestion & Processing Layer
- Unified Data Layer
- Unified Data Consumption Layer
Data Sources: The source data for a typical billing schema consists of tables such as
- Transaction Tables: Purchase Line Items, Cancellation Line Items etc.
- High Volume Masters: Account Master, Customer Billing Profiles etc.
- Reference Tables: Geo Code, Cancellation Code, Product, Vendors etc.
Unified Data Platform Setup: Delta Lake leverages Spark framework for its processing. Google’s Dataproc comes with built-in Spark installation. It also has built-in Google Cloud Storage (GCS) connectors that reduce the deployment time considerably. Appropriate version of DataProc instance should be deployed and appropriate decencies should be added to pom file.
Unified Data Ingestion & Processing Layer: This layer can be developed using tools such as Google’s DataFlow or Delta enabled ETL tools like Talend
- Initial Conversion: The initial load/delta-conversion process for high volume transactional tables such as Purchase Line Items, Cancellation Line Items can be optimized without generating any statistics that would limit the metadata generated. Other tables can be loaded/converted with or without stats based on the needs. The converted tables can be optimized later
- Incremental Batch/Streaming Loads: Delta lakes can manage both batch and streaming data seamlessly. Incremental data can be further ingested as per business logic
Unified Data Layer: The GCS buckets can be configured as storage layer for Delta Lake. The core business data gets stored in parquet format and logs will be maintained in json format. Further, to optimize query performance at this layer, data can be collocated using transaction-date based optimization and z-ordering on key fields such as order-codes/ cancellation-codes etc. This layer will be able to enable full ACID compliance with highly optimized querying / time travel on the data.
Unified Data Consumption Layer: Based on the nature of the data and access requirements, Google’s BigQuery and BigTables can be leveraged as the storage layer.
- High volume, low latency tables having fixed access patterns (reference tables such as Customer Billing Profile) can be stored in BigTables while others can be stored in BigQuery.
- Other BigQuery best practices such as portioning/clustering, query caching, join optimization, lifecycle management of Delta data and GCS buckets etc. can be implemented to further optimize storage and access of the data.
Effective data management
Organizations need actionable data. To fit the bill, data must provide a single version of the truth along with full governance capability. The enables decision makers to trust data and make insightful decisions. Delta Lake along with GCP provides an effective way to manage and control data.