Reputation: 2141
Say I have a relation Students
, with fields grade
and teacher
. I want to group by both grade and teacher, but retain a count of all the students per grade in each group. Something like:
classes = GROUP Students BY (grade,teacher);
classes = FOREACH classes {
GENERATE
(### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size,
Students as students,
teacher as teacher;
}
But I can't figure out how to do the filter from inside the group statement. Some kind of filter, but I don't know to scope the grade of the students outside vs. inside the group.
Upvotes: 0
Views: 128
Reputation: 3619
There are 2 ways of doing it:
1) Using Group By grade and teacher, than count, than Flatten and Group By grade.
classes = GROUP Students BY (grade,teacher);
teachers = FOREACH classes GENEARATE FLATTEN(group) as (grade,teacher), COUNT(Students) as perTeacehr;
grade = GROUP teachers BY grade;
result = FOREACH grade GENERATE FLATTEN(teachers), SUM(teachers.perTeacher) as perGrade;
describe result;
dump result;
2) Group By grade, than use UDF from BagGroup from DataFu library to do in memory group by, but this is vulnerable to possible heap memory exceptions, but is faster.
Upvotes: 1