Reputation: 1
when i upgraded dask from 2.1.0 to 2.2.0 (or 2.3.0), the following code changed its behaviour and stopped filtering parquet files as it did before. This is only appening with pyarrow engine (fastparquet engine is still filtering well).
I tried pyarrow 0.13.1, 0.14.0 and 0.14.1 without success on Dask 2.2.0 and 2.3.0.
My previous working setting is : Dask 2.1.0 with Pyarrow 0.14.1
This code was working for pyarrow engine
import dask.dataframe as dd
dd.read_parquet(directory, engine='pyarrow', filters=[(('DatePart', '>=', '2018-01-14'))])
To be noted that, the equivalent code for fastparquet engine has to remove one list level -> this is still working with fastparquet
import dask.dataframe as dd
dd.read_parquet(directory, engine='fastparquet', filters=[('DatePart', '>=', '2018-01-14')])
My parquet storage is partitionned by 'DatePart' with existing _metadata files.
Now resulting dataframe is not filtered anymore with the pyarrow engine. With no error messages.
Upvotes: 0
Views: 405
Reputation: 57301
It sounds like you are trying to report a bug. I recommend reporting bugs at https://github.com/dask/dask/issues/new
See https://docs.dask.org/en/latest/support.html#asking-for-help for more information on where the Dask developers prefer to see questions.
Upvotes: 0