Data Warehouse (DW) is a business intelligence system which analyzes historical data gathered from different sources and provides insightful information to organizations from those data quickly. Conventional DW are commonly hosted in a relational database which is designed for reporting and data analysis. The use cases that are supported by DW can be applicable and utilized across industries. For example:
- Analysis of sales data across different regions for multiple products/offerings/services
- Analysis of advertisement effectiveness across multiple channels with respect to demography/personal profile
- Analysis of social trends and sentiment analysis about product/offerings/service
- Measure service/product quality, operational efficiency, and financial performance of an organization across different verticals/line of service/products
However, Conventional DW comes with identified set of challenges. It needs upfront investment into costly software licenses and large infrastructure. And this is where cloud providers come in. Cloud providers are game changers as they help transform Capex to an Opex model in turn enabling affordability and accessibility to large DW systems, increased speed of development as well as drastically reducing the burden of owning and managing infrastructure.
For instance, Amazon Web Service (AWS) provides Redshift platform as a service (PaaS) that can host up to two petabyte of compressed data with a reasonable pricing model. Google Cloud Platform (GCP) gives BigQuery as a server less, fully managed, petabyte scale DW platform. Both of these platforms are based on similar principles of hosting compress data in columnar data storage. This enables the tools to read large volume of data in very less amount of time.
Comparative analysis between Redshift and BigQuery:
- Performance benchmark suggests that both of these tools takes similar amount of time to load the data whereas query performance is a bit better in Redshift compared to BigQuery
- BigQuery provides a true server less DW architecture where compute happens on required number of Google AppEngine on the fly based on query requirement. However, Redshift cluster nodes need to be in place for holding and querying data
- Both Redshift and BigQuery have the capability to encrypt data with industry standard encryptions methods
- Both Redshift and BigQuery have compliance to audit such as SOC 1/2/3
- Cost model of BigQuery is more complex. It is per query basis. If developer fires a wrong query that reads huge data, cost can be very high for that query. Redshift has predefined predictable cost model.
- Redshift supports JDBC/ODBC api over PostgreSQL query language. So it is easier to integrate with existing clients that used data from DW. BigQuery has client libraries using Json over HTTP.
- AWS marketplace is mature and provides interfaces to integrate with third party software. GCP is not mature in that respect
So the key takeaways…
- It takes a lot of time, bandwidth and cost to ingest large amount of data to DW system. So if the conventional DW system does not provide significant benefit, DW system should be with the same cloud provider as source data systems resides
- For low frequency query executed on a high volume of data, BigQuery is more cost effective. However for normal scenarios RedShift has a cost advantage.