Reputation: 3654
I have a project where I pass the following load_args
to read_parquet
:
filters = {'filters': [('itemId', '=', '9403cfde-7fe5-4c9c-916c-41ff0b595c5c')]}
According to the documentation, a List[Tuple]
like this should be accepted and I should get all partitions which match the predicate (or equivalently, filter out those that do not).
However, it gives me the following error:
│ │
│ /home/user/project/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/ |
| core.py:1275 in apply_conjunction │
| |
| 1264 | for part, stats in zip(parts, statistics): |
| 1265 | | | | if "filter" in stats and stats["filter"]: |
| 1266 | | | | | continue # Filtered by engine |
| 1267 | | | | try: |
| 1268 | | | | | c = toolz.groupby("name", stats["columns"])[column][0] |
| 1269 | | | | | min = c["min"] |
| 1270 | | | | | max = c["max"] |
| 1271 | | | | except KeyError: |
│ 1272 │ │ │ │ │ out_parts.append(part) │
│ 1273 │ │ │ │ │ out_statistics.append(stats) │
│ 1274 │ │ │ │ else: │
│ ❱ 1275 │ │ │ │ │ if ( │
│ 1276 │ │ │ │ │ │ operator in ("==", "=") │
│ 1277 │ │ │ │ │ │ and min <= value <= max │
│ 1278 │ │ │ │ │ │ or operator == "!=" │
╰──────────────────────────────────────────────────────────────────────────────────╯
TypeError: '<=' not supported between instances of 'NoneType' and 'str'
It seems that read_parquet
tries to compute min
and max
values for my str
value that I wish to filter on, but I'm not sure that makes sense in this case. Even so, str
values should be comparable (though it might not make a huge amount of sense in this case, seeing how the itemId
is a random UUID).
Still, I expected this to work. What am I doing wrong?
Upvotes: 1
Views: 402
Reputation: 390
As discovered by aywandji in the aformentioned github issue, the problem comes from the way dask access the min/max metatada.
It is accessed with an integer (the ith column) BUT this index of a column's name can change from one file to another in the same directory. (i.e. the filtered column is not at the same position in every file).
It is currently being patched and we hope it will get inserted in the next dask release!
From @filpa
It is fixed starting with the
dask=2023.1.1
release, which was released on 2023-01-28.
Upvotes: 2
Reputation: 339
The problem probably arises when min
and max
haven't been redefined before, so they still refer to the built-in functions that compute the minimum and maximum of two numbers, which obviously can't be compared with a string. Try using different name for these variables (as a rule of thumb, avoid too generic variable names which could be already defined in the standard library).
Upvotes: 0