1EnemyLeft
1EnemyLeft

Reputation: 959

Aggregate data grouping by two columns in Pig

I have these data that I need to group by two columns and then sum up two other fields. Suppose the name for these four columns are:OS,device,view,click. I basically want to know the count for each OS and device, how many views they have and how many clicks it have.

(2,3346,1,)
(3,3953,1,1)
(25,4840,1,1)
(2,94840,1,1)
(14,0526,1,1)
(37,4864,1,)
(2,7353,1,)

This is what I have so far

A is data: OS,device,view,click

B = GROUP A BY (OS,device);

Result = FOREACH  B {
    GENERATE group AS OS,device, SUM(view) AS visits, SUM(click) AS clicks;};
dump Result; 

This one won't work, error message is: Projected field [OS] does not exist in schema: group:tuple(OS:int,device:long),B:bag{:tuple(OS:int,device:long,view:int,click:int)}.

Upvotes: 0

Views: 1132

Answers (2)

Durga Viswanath Gadiraju
Durga Viswanath Gadiraju

Reputation: 3966

Here is the code which is tested, you are missing FLATTEN:

A = LOAD '/user/root/pig_data' using PigStorage(',') AS (OS, device, view, click);
B = GROUP A BY (OS, device);
RESULT = FOREACH B GENERATE FLATTEN(group) AS (OS, device), SUM(A.view) as views, SUM(A.click) as clicks;
dump RESULT;

Upvotes: 1

Patrick the Cat
Patrick the Cat

Reputation: 2168

I think you meant B in your example instead of J2 or J3, which may be in your actual code. Try:

B = GROUP A BY (OS, device);

Result = FOREACH B GENERATE
    group.OS AS OS:int,
    group.device AS device:long,
    SUM(B.view) AS visits:int,
    SUM(B.click) AS clicks:int;

dump Result; 

Upvotes: 0

Related Questions