a123
a123

Reputation: 71

Get value for unique record using Pig

Below is the input data set.

col1,col2,col3,col4,col5

key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10

Based on col2,col3,col4 will give unique record, I need to get any one value from col1 for the unique record, and populate as new field say col6. The expected output below

col1,col2,col3,col4,col5,col6

key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7

Below is the script, I am getting error.

A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2106: Error executing an algebraic function

Upvotes: 0

Views: 61

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

Looks like a good use case to use Nested Foreach

Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach

Input :

key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10

PigScript

A = load 'input.csv' using PigStorage(',')  AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
    ordered = ORDER A BY col1 DESC;
    latest = LIMIT ordered 1;
    GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};

DUMP B;

Output :

(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

Upvotes: 1

Related Questions