le Minh Nguyen
le Minh Nguyen

Reputation: 291

SQL database to Bigquery or SQL database to GCS to BigQuery

In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery

What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?

Upvotes: 2

Views: 979

Answers (2)

Jamil Noyda
Jamil Noyda

Reputation: 3649

Loading to GCS as multiple CSV files and loading later to BigQuery is faster because the loading will run within the Google Cloud infrastructure instead of going back and forth across different networks. And according to the documentation, BigQuery “does not guarantee data consistency for external data sources,” so opting for this recommendation is for our benefit.

Even other tools working on BigQuery integrations follow this standard. Take for example Skyvia. Its documentation mentions Bulk Import needs a GCS Bucket and it will take care of the writing of CSVs. It will offload this task to the user. From the docs: “Skyvia writes data into multiple temporary CSV files, upload them to Google Cloud Storage and then tells Google BigQuery to import data from these CSV files.“

So, this is a standard thing that Google recommends for toolmakers and integrators.

Upvotes: 0

Sarah Remo
Sarah Remo

Reputation: 719

BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:

  1. You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
  2. Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.

The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:

  • Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
  • According to this documentation, using Cloud Storage you can take advantage of long term storage:

When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.

  • And as mentioned by @John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
  • BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.

Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.

To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.

Upvotes: 5

Related Questions