Using Python to upload large csv files to Postgres RDS in AWS

Question

What's the easiest way to load a large csv file into a Postgres RDS database in AWS using Python?

To transfer data to a local postgres instance, I have previously used a psycopg2 connection to run SQL statements like:

COPY my_table FROM 'my_10gb_file.csv' DELIMITER ',' CSV HEADER;

However, when executing this against a remote AWS RDS database, this generates an error because the .csv file is on my local machine rather than the database server:

ERROR: must be superuser to COPY to or from a file
SQL state: 42501
Hint: Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

This answer explains why this doesn't work.

I'm now looking for the Python syntax to automate this using psql. I have a large number of .csv files I need to upload, so I need a script to automate this.

RobinL · Accepted Answer

First you need to create the table definitions in the RDS Postgres as normal using CREATE TABLE SQL statements.

Then you need to run a psql statement like this:

psql -p 5432 --host YOUR_HOST --username YOUR_USERNAME --dbname YOUR_DBNAME --command "\copy my_table FROM 'my_10gb_file.csv' DELIMITER ',' CSV HEADER"

In Python, we can set this up and execute it as follows:

host = "YOUR_HOST"
username = "YOUR_USERNAME"
dbname = "YOUR_DBNAME"

table_name = "my_table"
file_name = "my_10gb_file.csv"
command = "\copy {} FROM '{}' DELIMITER ',' CSV HEADER".format(table_name, file_name)

psql_template = 'psql -p 5432 --host {} --username {} --dbname {} --command "{}"'

bash_command = psql_template.format(host, username, dbname, command.strip())

process = subprocess.Popen(bash_command, stdout=subprocess.PIPE, shell=True) 

output, error = process.communicate()

Using Python to upload large csv files to Postgres RDS in AWS

Answers (1)

Related Questions