user866364
user866364

Reputation:

write pandas parquet partitioned into s3

How to write parquet partitioned by column into s3? I'm trying:

def write_df_into_s3(df, bucket_name, filepath, format="parquet"):
    buffer = None
    hook = S3Hook()

    if format == "parquet":
        buffer = BytesIO()
        df.to_parquet(buffer, index=False, partition_cols=['date'])
    else:
        raise Exception("Format not implemented!")

    hook.load_bytes(buffer.getvalue(), filepath, bucket_name)

    return f"s3://{bucket_name}/{filepath}"

But I got an error 'NoneType' object has no attribute '_isfilestore'.

Upvotes: 2

Views: 6006

Answers (1)

Vincent Claes
Vincent Claes

Reputation: 4788

For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

if you want to write your pandas dataframe as a partitioned parquet file to S3, do;

import awswrangler as wr
wr.s3.to_parquet(
    dataframe=df,
    path="s3://my-bucket/key/"
    dataset=True,
    partition_cols=["date"]
)

Upvotes: 2

Related Questions