Reputation: 845
I am reading parquet files/objects from AWS S3 using boto3 SDK. the parquet object can have many fields (columns) that I don't need to read. Assume the parquet object has 10 fields
A B C D E F G H I J
Is there a way to read just get just columns A E and H. I am currently reading the parquet obj using an s3client as below.
obj = s3client.get_object(Bucket=bucket, Key=key)
pd.read_parquet(io.BytesIO(obj['Body'].read()), **args)
Thank you
Upvotes: 3
Views: 1450
Reputation: 28684
For some reason, I never noticed this. Pandas and its backend parquet engines can read from s3 directly using fsspec/s3fs, and only fetch the bytes you need for the columns you specify:
df = pd.read_parquet(f"s3://{bucket}/{key}", columns=["A", "E", "H"])
You may need to specify extra storage_options={...}
if you need to give credentials for s3.
Upvotes: 1