Designing ETL architecture for a cloud-native data warehouse on Google Cloud Platform

Architecture components

The components of this architecture include (following the architecture diagram left to right):
  • A task orchestrator built using Google App Engine Cron Service, Google Cloud Pub/Sub control topic and Google Cloud Dataflow in streaming mode
  • Cloud Dataflow for importing bounded (batch) raw data from sources such as relational Google Cloud SQL databases (MySQL or PostgreSQL, via the JDBC connector) and files in Google Cloud Storage
  • Cloud Dataflow for importing unbounded (streaming) raw data from a Google Cloud Pub/Sub data ingestion topic
  • BigQuery for storing staging and final datasets
  • Additional ETL transformations enabled via Cloud Dataflow and embedded SQL statements
  • An interactive dashboard implemented via Google Sheets and connected to BigQuery
All these components are examples of fully-managed services on GCP; with this architecture, there's no infrastructure for you to deploy, manage, secure or scale and you only pay for what you use.
Many customers migrating their on-premises data warehouse to Google Cloud Platform (GCP) need ETL solutions that automate the tasks of extracting data from operational databases, making initial transformations to data, loading data records into Google BigQuery staging tables and initiating aggregation calculations. Quite often, these solutions reflect these main requirements:
  • Support for a large variety of operational data sources and support for relational as well as NoSQL databases, files and streaming events
  • Ability to use DML statements in BigQuery to do secondary processing of data in staging tables
  • Ability to maximize resource utilization by automatically scaling up or down depending on the workload, and scaling, if need be, to millions of records per second


Popular posts from this blog

What’s the difference between UI and UX?

Snap copied location sharing app Zenly to build Snap Map

IBM and SAP work together on artificial intelligence technology