denren
denren

Reputation: 1

till dask 2.2.0 read_parquet filters parameter doesn't seem to work anymore with pyarrow engine

when i upgraded dask from 2.1.0 to 2.2.0 (or 2.3.0), the following code changed its behaviour and stopped filtering parquet files as it did before. This is only appening with pyarrow engine (fastparquet engine is still filtering well).

I tried pyarrow 0.13.1, 0.14.0 and 0.14.1 without success on Dask 2.2.0 and 2.3.0.

My previous working setting is : Dask 2.1.0 with Pyarrow 0.14.1

This code was working for pyarrow engine

import dask.dataframe as dd
dd.read_parquet(directory, engine='pyarrow', filters=[(('DatePart', '>=', '2018-01-14'))])

To be noted that, the equivalent code for fastparquet engine has to remove one list level -> this is still working with fastparquet

import dask.dataframe as dd
dd.read_parquet(directory, engine='fastparquet', filters=[('DatePart', '>=', '2018-01-14')])

My parquet storage is partitionned by 'DatePart' with existing _metadata files.

Now resulting dataframe is not filtered anymore with the pyarrow engine. With no error messages.

Upvotes: 0

Views: 405

Answers (1)

MRocklin
MRocklin

Reputation: 57301

It sounds like you are trying to report a bug. I recommend reporting bugs at https://github.com/dask/dask/issues/new

See https://docs.dask.org/en/latest/support.html#asking-for-help for more information on where the Dask developers prefer to see questions.

Upvotes: 0

Related Questions