Hari
Hari

Reputation: 3

Emit a group only when condition meets

I have requirement to emit all records corresponds to a group, only when a condition is met. Below is the sample data set with alias name as "SAMPLE_DATA".

Col-1   |  Col-2 | Col-3
-------------------------
2       | 4      | 1
2       | 5      | 2
3       | 3      | 1
3       | 2      | 2
4       | 5      | 1
4       | 6      | 2

SAMPLE_DATA_GRP = GROUP SAMPLE_DATA BY Col-1;
RESULT = FOREACH SAMPLE_DATA_GRP {
    max_value = MAX(SAMPLE_DATA.Col-2);
    IF(max_value >= 5)
          GENERATE ALL RECORDS IN THAT GROUP;
}

RESULT should be:

Col-1   |  Col-2 | Col-3
-------------------------
2       | 4      | 1
2       | 5      | 2
----      ----     ---
4       | 5      | 1
4       | 6      | 2

Two groups got generated. First group is generate because max value of 4,5 is "5"(which meets our condition >=5). Same for second group (6 >= 5).

As I would be performing this operation on big dataset operations like distinct and join would be overkill. For this reason I have come up with pseudo code with one grouping to perform this operation.

Hope I have provided enough information. Thanks in advance.

I would be performing this operation on a huge data set. Doing operation like distinct and join would be overkill on the system. For this reason I have come up with this grouping approach.

Upvotes: 0

Views: 67

Answers (1)

Surender Raja
Surender Raja

Reputation: 3599

Please try the below code and see..

This solution is little lengthy ,but it will work

numbers = LOAD '/home/user/inputfiles/c1.txt' USING PigStorage(',') AS(c1:int,c2:int,c3:int);

num_grp = GROUP numbers by c1;

num_each = FOREACH num_grp 
                  {
                    max_each = MAX(numbers.c2);
                    generate flatten(group) as temp_c1, (max_each >= 5 ?1 :0) as indicator;
                  };

num_each_filtered = filter num_each BY indicator == 1; 

num_joined = join numbers BY c1,num_each_filtered by tem_c1;

num_output = FOREACH num_joined GENERATE c1,c2,c3;

dump num_output;

O/p:

Col-1   |  Col-2 | Col-3
-------------------------
2       | 4      | 1
2       | 5      | 2
----      ----     ---
4       | 5      | 1
4       | 6      | 2

Upvotes: 0

Related Questions