GregH
GregH

Reputation: 12868

Retrieving the most frequently occurring value in PIG

If I have the following data set:

c1   c2
---  ---
1    5
1    5
1    6
2    9
2    9
3    1
3    2

I want to return the most frequently occurring value in the second column (c2) for each value in the first column (c1). So I would want the returning data set to look like the following since for c1=1 the value "5" occurs twice and the value "6" only occurs once and for c1=2, the value of "9" occurs twice and no other value occurs:

1    5
2    9
3    1

The case I am having problems with is where there is an equal number of occurrences (in this case where c1=3.) In the case (c1=3) where there is an equal number of occurring values in c2 then I just want the first occurrence returned.

Any ideas would be helpful.

Upvotes: 2

Views: 1334

Answers (2)

s10z
s10z

Reputation: 1130

For everyone like me new to the language and trying to understand what is going on in the code above, my commented version (added describe and dump outputs):

Data Samle ('sample_data/test_limiting.pig'):

1|5
1|5
1|6
2|9
2|9
3|1
3|2

And the pig script itself:

A = LOAD 'sample_data/test_limiting.pig' USING PigStorage('|') AS (c1:int, c2:int);
----
-- A: {c1: int,c2: int}
----
-- (1,5)
-- (1,5)
-- (1,6)
-- (2,9)
-- (2,9)
-- (3,1)
-- (3,2)


B = GROUP A BY (c1, c2);
----
-- B: {group: (c1: int,c2: int),A: {(c1: int,c2: int)}}
----
-- ((1,5),{(1,5),(1,5)})
-- ((1,6),{(1,6)})
-- ((2,9),{(2,9),(2,9)})
-- ((3,1),{(3,1)})
-- ((3,2),{(3,2)})


C = FOREACH B GENERATE group, COUNT(A) as num;
----
-- C: {group: (c1: int,c2: int),num: long}
----
-- ((1,5),2)
-- ((1,6),1)
-- ((2,9),2)
-- ((3,1),1)
-- ((3,2),1)


D = GROUP C BY group.c1;
----
-- D: {group: int,C: {(group: (c1: int,c2: int),num: long)}}
----
-- (1,{((1,5),2),((1,6),1)})
-- (2,{((2,9),2)})
-- (3,{((3,1),1),((3,2),1)})


E = FOREACH D {
    SA = ORDER C BY num DESC;
    SB = LIMIT SA 1;
    GENERATE FLATTEN(SB.group);
}
----
-- E: {null::group: (c1: int,c2: int)}
----
-- ((1,5))
-- ((2,9))
-- ((3,1))

Btw, I needed to write the reference 'group' in lowercase as it bombed with the uppercase variant. Maybe due to different versions, dunno.

Upvotes: 1

frail
frail

Reputation: 4118

Assuming you have your c1 and c2 on A :

B = GROUP A BY (c1, c2)
C = FOREACH B GENERATE GROUP, COUNT(A) as num;

D = GROUP C BY GROUP.c1
E = FOREACH D {
    SA = ORDER C BY num DESC;
    SB = LIMIT SA 1;
    GENERATE FLATTEN(SB.group);
}

should solve your problem. (I wrote in notepad though, you should check if any flatten needed via describe/illustrate)

Upvotes: 2

Related Questions