pyarrow dataset filtering with multiple conditions

Question

I have a partitioned parquet dataset that I am trying to read into a pandas dataframe. The full parquet dataset doesn't fit into memory so I need to select only some partitions (the partition columns are Year, Month, and Date. I have the following:

pd.read_parquet(
    path_to_dataset,
    filters=[("Date", ">=", "20200715"), ("Date", "<=", "2020804")]
)

When I run this I get a memory error / the python program crashes. But when I run the following it works without issue even though in theory it would return the exact same amount of data (my dataset stops on the 4th).

pd.read_parquet(
    path_to_dataset,
    filters=[("Date", ">=", "20200715")]
)

It seems the second filter ("Date", "<=", "2020804") is taking precedence over the first and not being treated as a compound expression. In my current use case I can just remove the second filter but I have others where the data will be more in the middle of the total range and I would end up reading too much in again without the second filter.

I tried each of the following with no luck.

(("Date", ">=", "20200715") & ("Date", "<=", "2020804"))
("Date", ">=", "20200715", "Date", "<=", "2020804")

Is there a way to handle compound expressions on the same partition / column?

Reference documentation here: https://arrow.apache.org/docs/python/dataset.html

Samuel · Accepted Answer

"2020804" does not seem to be a valid date, you are missing a zero.

pyarrow dataset filtering with multiple conditions

Answers (1)

Related Questions