Richipal
Richipal

Reputation: 731

Pig split and join

I have a requirement to propagate field values from one row to another given type of record for example my raw input is

1,firefox,p  
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p

the desired result

1,firefox,p  
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p

I tried

A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined =  JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE  B::id,  B::type, B::browser, C::id, C::type;
dump joinedFields;

the result I got was

(,,,1,p  )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)

Any help is appreciated, Thanks.

Upvotes: 1

Views: 2634

Answers (1)

alexeipab
alexeipab

Reputation: 3619

PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.

inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
    P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
    PL = LIMIT P 1; -- make sure there is just one
    GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
}; 

Upvotes: 2

Related Questions