jstrong
jstrong

Reputation: 1607

Can pyarrow filter parquet struct and list columns?

Take the following table stored via pyarrow into Apache Parquet:

id regions
0 A ['us', 'uk']
1 B ['uk', 'mx']

I'd like to filter the regions column via parquet when loading data. Something like this:

import pyarrow.dataset as ds
dataset = ds.dataset("./example.parquet", format="parquet")
dataset.to_table(filter=ds.scalar('us').isin(ds.field('region')))

The expectation is that I would get back the first row, but not the second row.

This, however, does not work.The documentation does not have any useful information on how to do this kind of op. Is there any way of performing filters on more complex column types?

Upvotes: 1

Views: 1095

Answers (1)

0x26res
0x26res

Reputation: 13902

As far as I can tell from the documentation you can't do that.

The supported operations are <, <=, ==, >=, > as well as isin.

I think what you want is contains which isn't supported.

You can implement it yourself in arrow, but it's a bit of work:

import typing

import pandas as pd
import pyarrow as pa
from pyarrow import compute


def filter_list_column(table: pa.Table, column: str, value: typing.Any) -> pa.Table:
    flat_list = compute.list_flatten(table[column])
    flat_list_indices = compute.list_parent_indices(table[column])

    equal_mask = compute.equal(flat_list, value)
    equal_table_indices = compute.filter(flat_list_indices, equal_mask)
    return compute.take(table, equal_table_indices)


filter_list_column(table, "regions", "us")

Which gives you:

id regions
0 A ['us' 'uk']

Upvotes: 3

Related Questions