Reputation: 535
This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.
I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.
Upvotes: 4
Views: 13334
Reputation: 788
This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.
This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3
Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3 https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html
See here for the function reference: https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions
This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310
I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494
I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!
Here's an example query to create JSONL for you.
JSONL is JSON, with a single JSON object per line: https://jsonlines.org/
So you can dump a whole table to a JSONL file, for example... You could also do json_agg
in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.
select * from aws_s3.query_export_to_s3(
'select row_to_json(data) from (<YOUR QUERY HERE>) data',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.jsonl',
'us-east-1'),
options :='format text');
For CSV, something like this should do the trick:
select * from aws_s3.query_export_to_s3(
'<YOUR QUERY HERE>',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.csv',
'us-east-1'),
options :='format csv');
Upvotes: 3
Reputation: 535
Guess I can show one of my own versions here aswell which is based on copy_expert
in psycopg2
import io
import psycopg2
import boto3
resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()
def copyFun(bucket, select_query, filename):
query = f"""COPY {select_query} TO STDIN \
WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
file = io.StringIO()
cur.copy_expert(query, file)
resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())
Upvotes: 10
Reputation: 5739
We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w
.
Using shell script, we fire psql
command to copy
data to local file in EC2 App intance.
psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv
Then, using shell script, we fire s3cmd
command to Put
example.csv
to designated S3:bucket Location
.
s3cmd put example.csv s3://your-bucket/path/to/file/
Upvotes: 2