Reputation: 417
I'm pretty new to using pyArrow and I'm trying to read a Parquet file but filtering the data I'm loading. I have an end_time column, and when I try to filter based on some date it's working just fine and I can filter to get only the rows which match my date.
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
last_updated_at = datetime(2021,3,5,21,0,23)
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '>', last_updated_at)])
print(table_ad_sets_ongoing.num_rows)
But I also have sometimes a null value in this end_time field. So I tried filtering this way
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '=', None)])
print(table_ad_sets_ongoing.num_rows)
But the result is always 0 even if I actually have some rows with this null value. After some digging I suspect that this has to do with a null_selection_behavior which is by default at 'drop' value and so it skips null values.https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html#pyarrow.compute.filter I guess I should add this parameter to 'emit_null' but I can't find a way to do it.
Any idea?
Thank you
Upvotes: 3
Views: 3725
Reputation: 417
I finally found out the answer to my question. Answer come from arrow github (stupid from my side not to have a look at it earlier). https://github.com/apache/arrow/issues/9160
To filter a null field we have to use it this way :
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters=~ds.field("end_time").is_valid())
print(table_ad_sets_ongoing.num_rows)
Upvotes: 4