Lovell Mathews
Lovell Mathews

Reputation: 51

Reading from External Table vs Loading data and reading from it in Bigquery

I need to get data(csv format) from GCS to Bigquery and then perform ETL on it to produce results. The format of the csv which comes over might not be fixed and could subtly change with every file. Would it be better to create temp external tables to read data directly from GCS and then process it or would be better to load data into a staging table in bigquery and then process from it. I am trying to understand what is a better design in terms of execution efficiency. Is there a drawback with any of the approaches?

Upvotes: 0

Views: 478

Answers (2)

Parth Mehta
Parth Mehta

Reputation: 1917

If I understood correctly you want to deal with exceptions caused by bad entries without interrupting the process.

If that's the case you want to use Cloud DataFlow and use ParDo to deal with bad entries and stick them into cloud pubsub or equivalent to be dealt with using a separate system.

See the following url for further information.

https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

Hope this helps.

Upvotes: 1

brettrospective
brettrospective

Reputation: 47

Google Cloud Platform has a service called Composer. This is GCPs version of Apache Airflow which is software for managing data pipelines and workflows. Composer being a GCP product has built-in functions to work with GCS and BigQuery. I would recommend you build your pipeline in Composer.

https://cloud.google.com/composer/

We use composer with GCS and Bigquery to manage the entire ETL process.

Composer >> Extract raw file from service >> Store raw file to GCS Composer >> Extract raw file from GCS >> Transform raw file >> store transformed file to GCS >> store transformed file to BigQuery

Composer has a many additional pipeline management features that you can take advantage of as your ETLs get more complex.

Upvotes: 1

Related Questions