oya163
oya163

Reputation: 1519

How to read a single parquet file in S3 into pandas dataframe using boto3?

I am trying to read a single parquet file stored in S3 bucket and convert it into pandas dataframe using boto3.

Upvotes: 5

Views: 20756

Answers (5)

felipeformenti
felipeformenti

Reputation: 177

df = pd.read_parquet(
    full_s3_path,
    storage_options=dict(profile="<your_profile_name>")
)

Upvotes: 0

Vincent Claes
Vincent Claes

Reputation: 4768

For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

to read a single parquet file from s3 using awswrangler 1.x.x and above, do;

import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")

Upvotes: 7

James O&#39;Brien
James O&#39;Brien

Reputation: 1706

Maybe simpler:

import pyarrow.parquet as pq
import s3fs

s3 = s3fs.S3FileSystem()
df = pq.read_table('s3://blah/blah.parquet', filesystem=s3).to_pandas()

Upvotes: 0

Andrew
Andrew

Reputation: 41

There is info on using PyArrow to read a Parquet file from an S3 bucket into a Pandas dataframe here: https://arrow.apache.org/docs/python/parquet.html

import pyarrow.parquet as pq
import s3fs

dataset = pq.ParquetDataset('s3://<s3_path_to_folder_or_file>', 
filesystem=s3fs.S3FileSystem(), filters=[('colA', '=', 'some_value'), ('colB', '>=', some_number)])
table = dataset.read()
df = table.to_pandas()

I prefer this way of reading Parquet from S3 because it encourages the use of Parquet partitions through the filter parameter, but there is a bug affecting this approach https://issues.apache.org/jira/browse/ARROW-2038.

Upvotes: 4

oya163
oya163

Reputation: 1519

Found a way to simple read parquet file into dataframe with the utilization of boto3 package.

import boto3
import io
import pandas as pd

# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('my-bucket-name','path/to/parquet/file')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)

print(df.head())

Upvotes: 5

Related Questions