Reputation: 1607
Take the following table stored via pyarrow into Apache Parquet:
id | regions | |
---|---|---|
0 | A | ['us', 'uk'] |
1 | B | ['uk', 'mx'] |
I'd like to filter the regions column via parquet when loading data. Something like this:
import pyarrow.dataset as ds
dataset = ds.dataset("./example.parquet", format="parquet")
dataset.to_table(filter=ds.scalar('us').isin(ds.field('region')))
The expectation is that I would get back the first row, but not the second row.
This, however, does not work.The documentation does not have any useful information on how to do this kind of op. Is there any way of performing filters on more complex column types?
Upvotes: 1
Views: 1095
Reputation: 13902
As far as I can tell from the documentation you can't do that.
The supported operations are <
, <=
, ==
, >=
, >
as well as isin
.
I think what you want is contains
which isn't supported.
You can implement it yourself in arrow, but it's a bit of work:
import typing
import pandas as pd
import pyarrow as pa
from pyarrow import compute
def filter_list_column(table: pa.Table, column: str, value: typing.Any) -> pa.Table:
flat_list = compute.list_flatten(table[column])
flat_list_indices = compute.list_parent_indices(table[column])
equal_mask = compute.equal(flat_list, value)
equal_table_indices = compute.filter(flat_list_indices, equal_mask)
return compute.take(table, equal_table_indices)
filter_list_column(table, "regions", "us")
Which gives you:
id | regions | |
---|---|---|
0 | A | ['us' 'uk'] |
Upvotes: 3