Reputation: 2977
I'm having trouble understanding how group by group_name
works in a foreach
loop.
Let's say we already have a variable named grouped_data
that was defined as:
grouped_data = group dataset by (emp_id, dept_id);
And then we want to iterate over each record in grouped_data
with an aggregated column added in. So the following is written:
with_hours_worked = FOREACH grouped_data
GENERATE group AS grp,
SUM(dataset.worked_hours) AS hours ;
I'm confused as to what is going on in that last line, especially the group AS grp
part. Is grp
a tuple? Is the line from grouped_data
converted back into a group? If so, why?
Upvotes: 5
Views: 813
Reputation: 3570
Whenever you use group by
in Pig, for each of the groups a new register consisting of two parts is created: the first part is a tuple containing the values you grouped by, and the second one a bag containing all the values of that group.
For example, if you have the following data:
user_id, dept_id, blah_1, blah_2
1,41,pig,mapreduce
1,41,spark,apache
2,30,oh,yeah
After grouping by user_id and dept_id, you will have the following:
(1,41),{(pig,mapreduce),(spark,apache)}
(2,30),{(oh,yeah)}
The first part is what Pig calls group
, the tuple containing in this case user_id
and dept_id
. The group as grp
just renames it to grp
... Not a great name but that's what that code is doing!
Upvotes: 3