Reputation: 525
In a typical HDFS environment for Datawarehouse, I have seen some different stages in which the data are staged and transformed like in below. I am trying to design a system in Google cloud platform where I can perform all these transformations. Please help.
HDFS:: Landing Zone -> Stage 1 Zone -> Stage 2 Zone
Landing Zone - for having the raw data Stage 1 Zone - the raw data from Landing zone is transformed, and then changed to a different data format and/or denormalized and stored in Stage 1 Stage 2 Zone - Data from stage 1 is updated on a transaction table say HBASE. If it is just a time period data, then still HDFS based HIVE table Then, reporting happens from Stage 2 (There could also be multiple zones in between if for transformation)
My thought process of implementation in Google Cloud::
Landing(Google cloud storage) -> Stage 1 (BigQuery - hosts all time based data) -> Stage 2 (BigQuery for time based data/Maintain Big table for transactional data based on key)
My questions are below:
a) Does this implementation looks realistic. I am planning to use Dataflow for read and load between these Zones? What would be a better design, if anyone has implemented one before to build a warehouse?
b) How effective it is to use Dataflow to read Big Query and then update Big table? I have seen some Dataflow connector for Big table updates here
c) Can Json data be used as the primary format, since BigQuery supports that?
Upvotes: 0
Views: 671
Reputation: 2883
Upvotes: 2