Reputation: 19328
How can you apply predicate pushdown filters based on the Parquet min/max values when reading a Dask DataFrame?
Suppose you have a data lake with 4 Parquet files, each with nickname
and age
columns. This table shows the Parquet filename, the age min value, and the age max value for each file.
| File | min | max |
|---------------|-----|-----|
| pets1.parquet | 1 | 9 |
| pets2.parquet | 3 | 9 |
| pets3.parquet | 2 | 4 |
| pets4.parquet | 7 | 12 |
Suppose you'd like to perform an analysis on all the rows of data where the age is greater than 10. You know that you can skip pets1.parquet
, pets2.parquet
, and pets3.parquet
when performing this analysis because the max age in those files is less than the min value in our filter query. Skipping files can make some analyses run a lot faster.
Here's the read_parquet
method signature in the API docs: dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, split_row_groups=None, chunksize=None, **kwargs)
.
Will the filter
parameter help? Can you please provide a code snippet?
Upvotes: 4
Views: 907
Reputation: 28684
Yes, the filters argument is exactly what you want, something like the following:
dask.dataframe.read_parquet(path, filters=[('age', ">", 10)], ...)
This will skip row-groups/files where none of the rows can satisfy the condition.
Note that to get this to work, you may need either gather_statistics=True
or a global _metadata
file (whose existence will depend on how the data were written).
Upvotes: 2