jscriptor
jscriptor

Reputation: 845

reading selected column from a AWS S3 parquet file

I am reading parquet files/objects from AWS S3 using boto3 SDK. the parquet object can have many fields (columns) that I don't need to read. Assume the parquet object has 10 fields

A B C D E F G H I J

Is there a way to read just get just columns A E and H. I am currently reading the parquet obj using an s3client as below.

obj = s3client.get_object(Bucket=bucket, Key=key)
pd.read_parquet(io.BytesIO(obj['Body'].read()), **args)

Thank you

Upvotes: 3

Views: 1450

Answers (1)

mdurant
mdurant

Reputation: 28684

For some reason, I never noticed this. Pandas and its backend parquet engines can read from s3 directly using fsspec/s3fs, and only fetch the bytes you need for the columns you specify:

df = pd.read_parquet(f"s3://{bucket}/{key}", columns=["A", "E", "H"])

You may need to specify extra storage_options={...} if you need to give credentials for s3.

Upvotes: 1

Related Questions