Tsingis
Tsingis

Reputation: 535

A way to export psql table (or query) directly to AWS S3 as file (csv, json)

This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.

I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.

Upvotes: 4

Views: 13334

Answers (3)

Gordo
Gordo

Reputation: 788

This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.

This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3

Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3 https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html

See here for the function reference: https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions

This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310

I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494

I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!

Here's an example query to create JSONL for you.

JSONL is JSON, with a single JSON object per line: https://jsonlines.org/

So you can dump a whole table to a JSONL file, for example... You could also do json_agg in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.

select * from aws_s3.query_export_to_s3(
    'select row_to_json(data) from (<YOUR QUERY HERE>) data',
    aws_commons.create_s3_uri(
        'example-bucket/some/path',
        'whatever.jsonl',
        'us-east-1'),
    options :='format text');

For CSV, something like this should do the trick:

select * from aws_s3.query_export_to_s3(
    '<YOUR QUERY HERE>',
    aws_commons.create_s3_uri(
        'example-bucket/some/path',
        'whatever.csv',
        'us-east-1'),
    options :='format csv');

Upvotes: 3

Tsingis
Tsingis

Reputation: 535

Guess I can show one of my own versions here aswell which is based on copy_expert in psycopg2

import io
import psycopg2
import boto3

resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()

def copyFun(bucket, select_query, filename):
    query = f"""COPY {select_query} TO STDIN \
            WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
    file = io.StringIO()
    cur.copy_expert(query, file)
    resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())

Upvotes: 10

Red Boy
Red Boy

Reputation: 5739

We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w.

  • Using shell script, we fire psql command to copy data to local file in EC2 App intance.

    psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv

  • Then, using shell script, we fire s3cmd command to Put example.csv to designated S3:bucket Location.

    s3cmd put example.csv s3://your-bucket/path/to/file/

Upvotes: 2

Related Questions