h0r53
h0r53

Reputation: 3229

Write Pandas DataFrame To S3 as Pickle

Here are my requirements.

I created the following simple function that uploads a Pandas dataframe to s3 as a csv:

def df_to_s3_csv(df, filename, sep=','):
    s3 = boto3.resource('s3')
    buffer = io.StringIO()
    df.to_csv(buffer, sep=sep, index=False)
    s3.Object(s3bucket, f'{s3_upload_path}/{filename}').put(Body=buffer.getvalue())

This function works fine and does what it is supposed to. For the pickle file, I created the following function in a similar manner:

def df_to_s3_pckl(df, filename):
    s3 = boto3.resource('s3')
    buffer = io.BytesIO()
    df.to_pickle(buffer)
    buffer.seek(0)
    obj = s3.Object(s3bucket, f'{s3_upload_path}/{filename}')
    obj.put(Body=buffer.getvalue())

I tried this function with and without the seek portion and either way it throws the following error: ValueError: I/O operation on closed file.

Looking further into the issue, I found that buffer is considered closed as soon as df.to_pickle is called. This is reproducible by issuing these commands:

buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)

The above prints True. It appears that the BytesIO buffer is closed by to_pickle and therefore its data cannot be referenced. How can this issue be resolved, or is there an alternative that meets my requirements? I've found several questions on SO about how to upload to S3 using boto3, but nothing regarding how to upload pickle files created by Pandas using BytesIO buffers.

Here is a minimal reproducible example of the underlying issue:

import pandas as pd
import numpy as np
import io
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))                                                                                   
buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)

Upvotes: 3

Views: 2854

Answers (1)

h0r53
h0r53

Reputation: 3229

It appears that the issue can be traced to the pandas source code. This may ultimately be a bug in pandas revealed by unanticipated usage of a BytesIO object in the to_pickle method. I managed to circumvent the issue in the minimal reproducible example with the following code, which uses the dump method from the pickle module:

import pandas as pd
import numpy as np
import io
from pickle import dump
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))
buffer = io.BytesIO()
dump(df, buffer)
buffer.seek(0)
print(buffer.closed)

Now the print statement prints False and the BytesIO stream data can be accessed.

Upvotes: 2

Related Questions