Reputation: 59
I want to read a selected list of Parquet files from AWS S3. I know how to read all files in a directory using *parquet or just one single file by specifying just that key. However I would like to read only a specific list of files based on some prior user input.
Is this possible?
The following code is from their API Docs but does not address my requirement:
import dask.dataframe as dd
df = dd.read_parquet('s3://bucket/path/to/data-*.parque')
(OR)
df = dd.read_parquet('s3://bucket/path/to/file.parque')
Is there a way to pass in a list of target files in the read_parquet parameters instead?
Upvotes: 0
Views: 1286
Reputation: 450
Using Boto3, find all object keys, and then list all objects that you require and create a list with those objects and pass them in a for loop to the DFs
Using S3fs you can list objects like you can in Linux, you can store all the object names in a list and then pass it one by one in a for loop to the DF
More on Boto3 Getting specific objects: Boto3: grabbing only selected objects from the S3 resource
Source for s3fs: https://medium.com/swlh/using-s3-just-like-a-local-file-system-in-python-497737783f11
Upvotes: 1