Gowthaman
Gowthaman

Reputation: 59

How to selectively read Parquet files from AWS S3 as a Dask Data Frame?

I want to read a selected list of Parquet files from AWS S3. I know how to read all files in a directory using *parquet or just one single file by specifying just that key. However I would like to read only a specific list of files based on some prior user input.

Is this possible?

The following code is from their API Docs but does not address my requirement:

import dask.dataframe as dd

df = dd.read_parquet('s3://bucket/path/to/data-*.parque')
(OR)
df = dd.read_parquet('s3://bucket/path/to/file.parque')

Is there a way to pass in a list of target files in the read_parquet parameters instead?

Upvotes: 0

Views: 1286

Answers (1)

EngineJanwaar
EngineJanwaar

Reputation: 450

Using Boto3, find all object keys, and then list all objects that you require and create a list with those objects and pass them in a for loop to the DFs

Using S3fs you can list objects like you can in Linux, you can store all the object names in a list and then pass it one by one in a for loop to the DF

More on Boto3 Getting specific objects: Boto3: grabbing only selected objects from the S3 resource

Source for s3fs: https://medium.com/swlh/using-s3-just-like-a-local-file-system-in-python-497737783f11

Upvotes: 1

Related Questions