rosstripi
rosstripi

Reputation: 584

Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?

I currently have a script that reads the existing version of a csv saved to s3, combines that with the new rows in the pandas dataframe, and then writes that directly back to s3.

    try:
        csv_prev_content = str(s3_resource.Object('bucket-name', ticker_csv_file_name).get()['Body'].read(), 'utf8')
    except:
        csv_prev_content = ''

    csv_output = csv_prev_content + curr_df.to_csv(path_or_buf=None, header=False)
    s3_resource.Object('bucket-name', ticker_csv_file_name).put(Body=csv_output)

Is there a way that I can do this but with a gzip compressed csv? I want to read an existing .gz compressed csv on s3 if there is one, concatenate it with the contents of the dataframe, and then overwrite the .gz with the new combined compressed csv directly in s3 without having to make a local copy.

Upvotes: 8

Views: 14042

Answers (3)

There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)

import pandas as pd
from smart_open import open

df.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)

Upvotes: 2

user582175
user582175

Reputation: 1244

If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

import s3fs
import io
import gzip

    def write_df_to_s3(df, filename, path):
        s3 = s3fs.S3FileSystem(anon=False)
        with s3.open(path, 'wb') as f:
            gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)
            buf = io.TextIOWrapper(gz)
            df.to_csv(buf, index=False, encoding='UTF_8')
            gz.flush()
            gz.close()

TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827

Upvotes: -1

ramhiser
ramhiser

Reputation: 3540

Here's a solution in Python 3.5.2 using Pandas 0.20.1.

The source DataFrame can be read from a S3, a local CSV, or whatever.

import boto3
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper

df = pd.read_csv('s3://ramey/test.csv')
gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')
s3_object.put(Body=gz_buffer.getvalue())

Upvotes: 19

Related Questions