simplycoding
simplycoding

Reputation: 2977

How does 'group as' work in Pig?

I'm having trouble understanding how group by group_name works in a foreach loop.

Let's say we already have a variable named grouped_data that was defined as:

grouped_data = group dataset by (emp_id, dept_id);

And then we want to iterate over each record in grouped_data with an aggregated column added in. So the following is written:

with_hours_worked = FOREACH grouped_data 
                    GENERATE group AS grp, 
                             SUM(dataset.worked_hours) AS hours ;

I'm confused as to what is going on in that last line, especially the group AS grp part. Is grp a tuple? Is the line from grouped_data converted back into a group? If so, why?

Upvotes: 5

Views: 813

Answers (1)

Balduz
Balduz

Reputation: 3570

Whenever you use group by in Pig, for each of the groups a new register consisting of two parts is created: the first part is a tuple containing the values you grouped by, and the second one a bag containing all the values of that group.

For example, if you have the following data:

user_id, dept_id, blah_1, blah_2
1,41,pig,mapreduce
1,41,spark,apache
2,30,oh,yeah

After grouping by user_id and dept_id, you will have the following:

(1,41),{(pig,mapreduce),(spark,apache)}
(2,30),{(oh,yeah)}

The first part is what Pig calls group, the tuple containing in this case user_id and dept_id. The group as grp just renames it to grp... Not a great name but that's what that code is doing!

Upvotes: 3

Related Questions