user3489477
user3489477

Reputation: 55

pig how to filter distinct couples (pairs)

I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:

John   Paul
Tom    Nik
Mark   Bill
Tom    Nik
Paul   John

I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:

John   Paul
Tom    Nik
Mark   Bill
Paul   John

The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?

Upvotes: 2

Views: 360

Answers (1)

mr2ert
mr2ert

Reputation: 5186

I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:

-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ; 
C = DISTINCT B ;

By sorting the names so that the 'smaller' always appears first both John Paul and Paul John should now be in the same order, making the DISTINCT eliminate one.

However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul case will not be filtered correctly.

Upvotes: 1

Related Questions