Fabrice Lefloch
Fabrice Lefloch

Reputation: 417

PyArrow read_table filter null values

I'm pretty new to using pyArrow and I'm trying to read a Parquet file but filtering the data I'm loading. I have an end_time column, and when I try to filter based on some date it's working just fine and I can filter to get only the rows which match my date.

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
last_updated_at =  datetime(2021,3,5,21,0,23)
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '>', last_updated_at)]) 
print(table_ad_sets_ongoing.num_rows)

But I also have sometimes a null value in this end_time field. So I tried filtering this way

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '=', None)]) 
print(table_ad_sets_ongoing.num_rows)

But the result is always 0 even if I actually have some rows with this null value. After some digging I suspect that this has to do with a null_selection_behavior which is by default at 'drop' value and so it skips null values.https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html#pyarrow.compute.filter I guess I should add this parameter to 'emit_null' but I can't find a way to do it.

Any idea?

Thank you

Upvotes: 3

Views: 3725

Answers (1)

Fabrice Lefloch
Fabrice Lefloch

Reputation: 417

I finally found out the answer to my question. Answer come from arrow github (stupid from my side not to have a look at it earlier). https://github.com/apache/arrow/issues/9160

To filter a null field we have to use it this way :

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters=~ds.field("end_time").is_valid())
print(table_ad_sets_ongoing.num_rows)

Upvotes: 4

Related Questions