Roshan Fernando
Roshan Fernando

Reputation: 525

BigQuery Datawarehouse design?

In a typical HDFS environment for Datawarehouse, I have seen some different stages in which the data are staged and transformed like in below. I am trying to design a system in Google cloud platform where I can perform all these transformations. Please help.

HDFS:: Landing Zone -> Stage 1 Zone -> Stage 2 Zone

Landing Zone - for having the raw data Stage 1 Zone - the raw data from Landing zone is transformed, and then changed to a different data format and/or denormalized and stored in Stage 1 Stage 2 Zone - Data from stage 1 is updated on a transaction table say HBASE. If it is just a time period data, then still HDFS based HIVE table Then, reporting happens from Stage 2 (There could also be multiple zones in between if for transformation)

My thought process of implementation in Google Cloud::

Landing(Google cloud storage) -> Stage 1 (BigQuery - hosts all time based data) -> Stage 2 (BigQuery for time based data/Maintain Big table for transactional data based on key)

My questions are below:

a) Does this implementation looks realistic. I am planning to use Dataflow for read and load between these Zones? What would be a better design, if anyone has implemented one before to build a warehouse?

b) How effective it is to use Dataflow to read Big Query and then update Big table? I have seen some Dataflow connector for Big table updates here

c) Can Json data be used as the primary format, since BigQuery supports that?

Upvotes: 0

Views: 671

Answers (1)

F10
F10

Reputation: 2883

  1. There's a solution that may fit your scenario. I would load the data to Cloud Storage, read it and do the transformation with Dataflow, then either send it to Cloud Storage to be loaded in Bigquery after that and/or write directly to BigTable with the Dataflow connector that you mentioned.
  2. As I mentioned before, you could send your transformed data to both databases from Dataflow. Keep in mind that BigQuery and Bigtable are good for analytics, however, Bigtable has a low-latency read and write access and BigQuery has a high latency as it does query jobs to gather the data.
  3. Yes, it'll be a good idea as you could load your JSON data from Cloud Storage to BigQuery directly.

Upvotes: 2

Related Questions