Parquet predicate pushdown filtering with Dask

Question

How can you apply predicate pushdown filters based on the Parquet min/max values when reading a Dask DataFrame?

Suppose you have a data lake with 4 Parquet files, each with nickname and age columns. This table shows the Parquet filename, the age min value, and the age max value for each file.

| File          | min | max |
|---------------|-----|-----|
| pets1.parquet | 1   | 9   |
| pets2.parquet | 3   | 9   |
| pets3.parquet | 2   | 4   |
| pets4.parquet | 7   | 12  |

Suppose you'd like to perform an analysis on all the rows of data where the age is greater than 10. You know that you can skip pets1.parquet, pets2.parquet, and pets3.parquet when performing this analysis because the max age in those files is less than the min value in our filter query. Skipping files can make some analyses run a lot faster.

Here's the read_parquet method signature in the API docs: dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, split_row_groups=None, chunksize=None, **kwargs).

Will the filter parameter help? Can you please provide a code snippet?

mdurant · Accepted Answer

Yes, the filters argument is exactly what you want, something like the following:

dask.dataframe.read_parquet(path, filters=[('age', ">", 10)], ...)

This will skip row-groups/files where none of the rows can satisfy the condition.

Note that to get this to work, you may need either gather_statistics=True or a global _metadata file (whose existence will depend on how the data were written).

Parquet predicate pushdown filtering with Dask

Answers (1)

Related Questions