Tabber
Tabber

Reputation: 171

Moving data from a database to Google Big Query

I am trying to figure out a way to move data in batches(not streamed in) from on-prem database (Oracle) to Google Bigquery.

I am trying to do this via Google Cloud Composer however I understand that I need to get data into GCS first before I can use composer to move it to BigQuery.

What is the best way to move data from on-prem DB to GCS. or is there a service in Google cloud that allows me to move data directly from the DB to Bigquery?

Can this be better done with Cloud Dataflow instead? Thank you

Upvotes: 1

Views: 3300

Answers (2)

Caio Belfort
Caio Belfort

Reputation: 555

You can schedule your data migrations to GCS outside of Cloud Composer. Then create a DAG with GoogleCloudStorageObjectSensor which pokes GCS in intervals to check if the file arrives. Then you can copy your data to BigQuery with GCSToBigQueryOperator.

Upvotes: 2

Juan Urrego
Juan Urrego

Reputation: 353

So I think you are mixing a little bit tools/services with the main goal that is the migration. So before anything I'm going to explain some concepts:

1) Cloud Composer is just a GCP product that provide Airflow as Service, and what is Airflow? an orchestration platform where you can programmatically author, schedule and monitor workflows. Therefore, this is useful once you have multiple pipelines, jobs or task to orchestrate. It could be used for the initial migration, but is not necessary a mandatory service to be used.

2) DataFlow is a great product from GCP, that uses Apache Beam as its main pipelining language, so you can deploy batch or stream pipelines. Dataflow is the alternative of classic products/frameworks such as Spark or Hadoop if you want to have a more Cloud Native experience in GCP. The next image it's the recommendation about when to use DataProc (Spark, Hadoop, etc as a service) or Dataflow:

enter image description here

Now that we understand the terms here it comes the answer: It depends. So these services can help you in the process but each infrastructure is quite unique. For instance:

  • Can we stop services and make the migration? Or should it happen with the database accepting requests?
  • How long it should take? max time?
  • It's your DB behind firewalls? VPN could affect?
  • It is a lift and shift approach? I mean, are you going to copy paste directly or is there going to be data refactoring to use BigQuery in the proper way?

So as you can see it depends a little bit from your needs and current infrastructure. However, I'm going to tell you what we did in our company:

In my case I kind of made both approaches: Lift&Shift and Data Refactoring. The reason fo doing that was to provide the business with something to query, while we were busy refactoring the data architecture. The way that we did the first one was moving just as it is the tables from our MSSQL database to BigQuery. The solution was a set of python scripts to copy the schemas and recreate the tables on the other side (table by table in small batches).

Note: This process could be done using DataFlow for instance, however if your DB is behind Firewalls and it requires VPN access and so on, you will have a bad time trying to make it work.

The suggested way by Google is to make dump of your tables in GCS and then you can reconstruct them using the create table from source option in BigQuery (that it behaves similar as Athena in AWS). However, keep in mind that BigQuery just recognises Avro, parquet, JSON, CSV and ORC formats. Meaning that your DB dump should export in one of these formats. If you use something like JSON or CSV, Google BigQuery is smart enough to auto detect your schema, nevertheless!!! it's not perfect and it will make a lot of assumptions based on the first 100 rows. If you use avro or parquet, the schema will be defined by those files, so you will reduce the risk of false positives.

Note: In the case of Oracle you can "export" to CSV or JSON using something like this https://blogs.oracle.com/opal/fast-generation-of-csv-and-json-from-oracle-database

So in general there are plenty options to tackle this issue and also a lot of things to consider. This is a document from google about migration from TeraData to BigQuery, but in general it works for any other kind of migration:

https://cloud.google.com/solutions/migration/td2bq/td-bq-migration-overview

Good luck!

Upvotes: 2

Related Questions