Reputation: 190
Not sure why pig latin is automatically eliminating the null records without programmer's intension while using FILTER statement on a particular field in a dataset.Any explanation is much more appreciated.
Upvotes: 0
Views: 85
Reputation: 826
Pig omits nulls in general, making it a bit painful to work with a corrupted data.
Pig produces a warning for the invalid field(null), but does not halt its processing
Says in Hadoop-The Definitive Guide by Tom White.
The approach to deal with such issues is either replace the missing values by some code like 999 or split the data by good and bad quality and take a look on what is going on.
We in general do the data quality check by counting missing values on various steps of the pipeline data aggregation.
Upvotes: 1