Reputation: 3
I have requirement to emit all records corresponds to a group, only when a condition is met. Below is the sample data set with alias name as "SAMPLE_DATA".
Col-1 | Col-2 | Col-3
-------------------------
2 | 4 | 1
2 | 5 | 2
3 | 3 | 1
3 | 2 | 2
4 | 5 | 1
4 | 6 | 2
SAMPLE_DATA_GRP = GROUP SAMPLE_DATA BY Col-1;
RESULT = FOREACH SAMPLE_DATA_GRP {
max_value = MAX(SAMPLE_DATA.Col-2);
IF(max_value >= 5)
GENERATE ALL RECORDS IN THAT GROUP;
}
RESULT should be:
Col-1 | Col-2 | Col-3
-------------------------
2 | 4 | 1
2 | 5 | 2
---- ---- ---
4 | 5 | 1
4 | 6 | 2
Two groups got generated. First group is generate because max value of 4,5 is "5"(which meets our condition >=5). Same for second group (6 >= 5).
As I would be performing this operation on big dataset operations like distinct and join would be overkill. For this reason I have come up with pseudo code with one grouping to perform this operation.
Hope I have provided enough information. Thanks in advance.
I would be performing this operation on a huge data set. Doing operation like distinct and join would be overkill on the system. For this reason I have come up with this grouping approach.
Upvotes: 0
Views: 67
Reputation: 3599
Please try the below code and see..
This solution is little lengthy ,but it will work
numbers = LOAD '/home/user/inputfiles/c1.txt' USING PigStorage(',') AS(c1:int,c2:int,c3:int);
num_grp = GROUP numbers by c1;
num_each = FOREACH num_grp
{
max_each = MAX(numbers.c2);
generate flatten(group) as temp_c1, (max_each >= 5 ?1 :0) as indicator;
};
num_each_filtered = filter num_each BY indicator == 1;
num_joined = join numbers BY c1,num_each_filtered by tem_c1;
num_output = FOREACH num_joined GENERATE c1,c2,c3;
dump num_output;
O/p:
Col-1 | Col-2 | Col-3
-------------------------
2 | 4 | 1
2 | 5 | 2
---- ---- ---
4 | 5 | 1
4 | 6 | 2
Upvotes: 0