Reputation: 55
I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:
John Paul
Tom Nik
Mark Bill
Tom Nik
Paul John
I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:
John Paul
Tom Nik
Mark Bill
Paul John
The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?
Upvotes: 2
Views: 360
Reputation: 5186
I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:
-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ;
C = DISTINCT B ;
By sorting the names so that the 'smaller' always appears first both John Paul
and Paul John
should now be in the same order, making the DISTINCT
eliminate one.
However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul
case will not be filtered correctly.
Upvotes: 1