user1436111
user1436111

Reputation: 2141

Pig - counting members across a group

Say I have a relation Students, with fields grade and teacher. I want to group by both grade and teacher, but retain a count of all the students per grade in each group. Something like:

classes = GROUP Students BY (grade,teacher);
classes = FOREACH classes {
   GENERATE
      (### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size,
      Students as students,
      teacher as teacher;
}

But I can't figure out how to do the filter from inside the group statement. Some kind of filter, but I don't know to scope the grade of the students outside vs. inside the group.

Upvotes: 0

Views: 128

Answers (1)

alexeipab
alexeipab

Reputation: 3619

There are 2 ways of doing it:

1) Using Group By grade and teacher, than count, than Flatten and Group By grade.

classes = GROUP Students BY (grade,teacher);
teachers = FOREACH classes GENEARATE FLATTEN(group) as (grade,teacher), COUNT(Students) as perTeacehr;
grade = GROUP teachers BY grade;
result = FOREACH grade GENERATE FLATTEN(teachers), SUM(teachers.perTeacher) as perGrade;
describe result;
dump result;

2) Group By grade, than use UDF from BagGroup from DataFu library to do in memory group by, but this is vulnerable to possible heap memory exceptions, but is faster.

Upvotes: 1

Related Questions