Rutger Hofste
Rutger Hofste

Reputation: 4373

Speed up PostgreSQL to BigQuery

I would like to upload some data that is currently stored in postGreSQL to Google Bigquery to see how the two tools compare.

To move data around there are many options but the most user friendly (for me) one I found thus far leverages the power of python pandas.

sql = "SELECT * FROM {}".format(input_table_name)
i = 0
for chunk in pd.read_sql_query(sql , engine, chunksize=10000):
    print("Chunk number: ",i)
    i += 1
    df.to_gbq(destination_table="my_new_dataset.test_pandas",
              project_id = "aqueduct30",
              if_exists= "append" )

however this approach is rather slow and I was wondering what options I have to speed things up. My table has 11 million rows and 100 columns.

The postGreSQL is on AWS RDS and I call python from an Amazon EC2 instance. Both are large and fast. I am currently not using multiple processors although there are 16 available.

Upvotes: 0

Views: 1452

Answers (1)

Graham Polley
Graham Polley

Reputation: 14791

As alluded to by the comment from JosMac, your solution/approach simply won't scale with large datasets. As you're already running on AWS/RDS then something like the following would be better in my opinion:

  1. Export Postgres table(s) to S3
  2. Use the GCS transfer service to pull export from S3 into GCS
  3. Load directly into BigQuery from GCS (consider automating this pipeline using Cloud Functions and Dataflow)

Upvotes: 2

Related Questions