Reputation: 606
I have a partitioned parquet dataset that I am trying to read into a pandas dataframe. The full parquet dataset doesn't fit into memory so I need to select only some partitions (the partition columns are Year, Month, and Date. I have the following:
pd.read_parquet(
path_to_dataset,
filters=[("Date", ">=", "20200715"), ("Date", "<=", "2020804")]
)
When I run this I get a memory error / the python program crashes. But when I run the following it works without issue even though in theory it would return the exact same amount of data (my dataset stops on the 4th).
pd.read_parquet(
path_to_dataset,
filters=[("Date", ">=", "20200715")]
)
It seems the second filter ("Date", "<=", "2020804")
is taking precedence over the first and not being treated as a compound expression. In my current use case I can just remove the second filter but I have others where the data will be more in the middle of the total range and I would end up reading too much in again without the second filter.
I tried each of the following with no luck.
(("Date", ">=", "20200715") & ("Date", "<=", "2020804"))
("Date", ">=", "20200715", "Date", "<=", "2020804")
Is there a way to handle compound expressions on the same partition / column?
Reference documentation here: https://arrow.apache.org/docs/python/dataset.html
Upvotes: 1
Views: 3481
Reputation: 3053
"2020804" does not seem to be a valid date, you are missing a zero.
Upvotes: 5