Retrieving the most frequently occurring value in PIG

Question

If I have the following data set:

I want to return the most frequently occurring value in the second column (c2) for each value in the first column (c1). So I would want the returning data set to look like the following since for c1=1 the value "5" occurs twice and the value "6" only occurs once and for c1=2, the value of "9" occurs twice and no other value occurs:

1    5
2    9
3    1

The case I am having problems with is where there is an equal number of occurrences (in this case where c1=3.) In the case (c1=3) where there is an equal number of occurring values in c2 then I just want the first occurrence returned.

Any ideas would be helpful.

frail · Accepted Answer

Assuming you have your c1 and c2 on A :

B = GROUP A BY (c1, c2)
C = FOREACH B GENERATE GROUP, COUNT(A) as num;

D = GROUP C BY GROUP.c1
E = FOREACH D {
    SA = ORDER C BY num DESC;
    SB = LIMIT SA 1;
    GENERATE FLATTEN(SB.group);
}

should solve your problem. (I wrote in notepad though, you should check if any flatten needed via describe/illustrate)

Retrieving the most frequently occurring value in PIG

Answers (2)

Related Questions