Siddharth Verma
Siddharth Verma

Reputation: 113

Migration from Google Cloud SQL to Google Bigquery

I need to migrate two of my tables from Google Cloud SQL to Google Bigquery.

The data is about 1 TB in size and about 5 months of stock market tick data.

I understand from the documentation, that I can export data as CSV and load it into Bigquery. I wish to do date-wise partitions in Google Bigquery, as suggested in best practices. Also, I wish to do a robust verification after each date's data is migrated, to be sure that data doesn't get corrupted during transfer.

My question is, how can I write code to do this loop over dates, and within each loop:
1. export from Google Cloud SQL
2. load into Google BigQuery
3. Test that data has not got corrupted in transfer

Upvotes: 4

Views: 3161

Answers (3)

Jian He
Jian He

Reputation: 510

You can use BigQuery Cloud SQL federated query to copy Cloud SQL table into BigQuery. You can do it with one BigQuery SQL statement. For example, following SQL copy MySQL table sales_20191002 to BigQuery table demo.sales_20191002.

INSERT
   demo.sales_20191002 (column1, column2 etc..)
SELECT
   *
FROM
   EXTERNAL_QUERY(
      "project.us.connection",
      "SELECT * FROM sales_20191002;");

EXTERNAL_QUERY("connection", "foreign SQL") would execute the "foreign SQL" in the Cloud SQL database specified in "connection" and return result back to BigQuery. "foreign SQL" is the source database SQL dialect (MySQL or PostgreSQL).

Before running above SQL query, you need to create a BigQuery connection which point to your Cloud SQL database.

Finally, you can validate the data quality by checking some high level statistics of two tables. For example, does number of rows match? does group by result match?

Upvotes: 3

Mar Cial R
Mar Cial R

Reputation: 946

Setting aside the fact that I would probably also choose Dataflow for a task like this, I've been working on a python sample that does exactly what Siddharth asked for. I'd expect it to work well enough for small datasets.

It gathers a list of days from a source CloudSQL table. Then, for each day:

1) exports relevant CloudSQL rows into separate CSV files

2) loads each CSV into a separate BQ table and

3) launches a simple verification query both in MySQL and BQ

if __name__ == '__main__':
  days = mysql_distinct_days()[:NUM_DAYS_TO_TEST]

  do_extract_and_load(days)

  do_checks(days)

  if DELETE_TABLES:
    delete_tables(days)

I've shared a gist with the full source code and output in hopes that it'll prove helpful to somebody down the road. :-)

Upvotes: 0

Graham Polley
Graham Polley

Reputation: 14791

Here's what I would do:

  1. Export your table(s) in Cloud SQL to CSV
  2. Upload those CSV files to Google Cloud Storage
  3. Write a Cloud Dataflow pipeline to ingest the CSV files, and write to the correct partitions.
  4. Use SQL in BigQuery to validate your data

Upvotes: 1

Related Questions