Reputation: 1519
I am trying to read a single parquet file stored in S3 bucket and convert it into pandas dataframe using boto3.
Upvotes: 5
Views: 20756
Reputation: 177
df = pd.read_parquet(
full_s3_path,
storage_options=dict(profile="<your_profile_name>")
)
Upvotes: 0
Reputation: 4768
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
to read a single parquet file from s3 using awswrangler 1.x.x and above, do;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")
Upvotes: 7
Reputation: 1706
Maybe simpler:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
df = pq.read_table('s3://blah/blah.parquet', filesystem=s3).to_pandas()
Upvotes: 0
Reputation: 41
There is info on using PyArrow to read a Parquet file from an S3 bucket into a Pandas dataframe here: https://arrow.apache.org/docs/python/parquet.html
import pyarrow.parquet as pq
import s3fs
dataset = pq.ParquetDataset('s3://<s3_path_to_folder_or_file>',
filesystem=s3fs.S3FileSystem(), filters=[('colA', '=', 'some_value'), ('colB', '>=', some_number)])
table = dataset.read()
df = table.to_pandas()
I prefer this way of reading Parquet from S3 because it encourages the use of Parquet partitions through the filter parameter, but there is a bug affecting this approach https://issues.apache.org/jira/browse/ARROW-2038.
Upvotes: 4
Reputation: 1519
Found a way to simple read parquet file into dataframe with the utilization of boto3 package.
import boto3
import io
import pandas as pd
# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('my-bucket-name','path/to/parquet/file')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
print(df.head())
Upvotes: 5