Reputation: 3229
Here are my requirements.
I created the following simple function that uploads a Pandas dataframe to s3 as a csv:
def df_to_s3_csv(df, filename, sep=','):
s3 = boto3.resource('s3')
buffer = io.StringIO()
df.to_csv(buffer, sep=sep, index=False)
s3.Object(s3bucket, f'{s3_upload_path}/{filename}').put(Body=buffer.getvalue())
This function works fine and does what it is supposed to. For the pickle file, I created the following function in a similar manner:
def df_to_s3_pckl(df, filename):
s3 = boto3.resource('s3')
buffer = io.BytesIO()
df.to_pickle(buffer)
buffer.seek(0)
obj = s3.Object(s3bucket, f'{s3_upload_path}/{filename}')
obj.put(Body=buffer.getvalue())
I tried this function with and without the seek
portion and either way it throws the following error: ValueError: I/O operation on closed file.
Looking further into the issue, I found that buffer
is considered closed
as soon as df.to_pickle
is called. This is reproducible by issuing these commands:
buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)
The above prints True
. It appears that the BytesIO
buffer is closed by to_pickle
and therefore its data cannot be referenced. How can this issue be resolved, or is there an alternative that meets my requirements? I've found several questions on SO about how to upload to S3 using boto3, but nothing regarding how to upload pickle files created by Pandas using BytesIO buffers.
Here is a minimal reproducible example of the underlying issue:
import pandas as pd
import numpy as np
import io
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))
buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)
Upvotes: 3
Views: 2854
Reputation: 3229
It appears that the issue can be traced to the pandas source code. This may ultimately be a bug in pandas revealed by unanticipated usage of a BytesIO
object in the to_pickle
method. I managed to circumvent the issue in the minimal reproducible example with the following code, which uses the dump
method from the pickle
module:
import pandas as pd
import numpy as np
import io
from pickle import dump
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))
buffer = io.BytesIO()
dump(df, buffer)
buffer.seek(0)
print(buffer.closed)
Now the print statement prints False
and the BytesIO
stream data can be accessed.
Upvotes: 2