Reputation: 4373
I would like to upload some data that is currently stored in postGreSQL to Google Bigquery to see how the two tools compare.
To move data around there are many options but the most user friendly (for me) one I found thus far leverages the power of python pandas.
sql = "SELECT * FROM {}".format(input_table_name)
i = 0
for chunk in pd.read_sql_query(sql , engine, chunksize=10000):
print("Chunk number: ",i)
i += 1
df.to_gbq(destination_table="my_new_dataset.test_pandas",
project_id = "aqueduct30",
if_exists= "append" )
however this approach is rather slow and I was wondering what options I have to speed things up. My table has 11 million rows and 100 columns.
The postGreSQL is on AWS RDS and I call python from an Amazon EC2 instance. Both are large and fast. I am currently not using multiple processors although there are 16 available.
Upvotes: 0
Views: 1452
Reputation: 14791
As alluded to by the comment from JosMac, your solution/approach simply won't scale with large datasets. As you're already running on AWS/RDS then something like the following would be better in my opinion:
Upvotes: 2