Reputation: 305
I have 3 sets of data all in the format (acctid:chararray, rule:chararray, value:charrarray)
Set 1 file:
123;R1;r1 version set 1 123
123;R2;r2 version set 1 123
123;R3;r3 version set 1 123
124;R1;r1 version set 1 124
124;R2;r2 version set 1 124
124;R3;r3 version set 1 124
Set 2 file: // changes R2
123;R2;r2 version set 2 123
124;R2;r2 version set 2 124
Set 3 file:
123;R4;r4 version set 3 123
124;R4;r4 version set 3 124
I need to merge the data such that:
in the first data set, the R2 values get changed to those from the second set
R3 values get removed
R4 values get added
Then I can do a group by account id and get:
final:
123;R1;r1 version set 1 123
123;R2;r2 version set 2 123
123;R4;r4 version set 3 123
124;R1;r1 version set 1 124
124;R2;r2 version set 2 124
124;R4;r4 version set 3 124
I tried various joins and merges but I don't understand if this is even possible. Thanks
Upvotes: 0
Views: 83
Reputation: 605
Try this it will give desired output
set_1 = LOAD '/home/abhis/set_1' USING PigStorage(';') AS (acctid:chararray, rule: chararray, value: chararray);
set_2 = LOAD '/home/abhis/set_2' USING PigStorage(';') AS (acctid:chararray, rule: chararray, value: chararray);
set_3 = LOAD '/home/abhis/set_3' USING PigStorage(';') AS (acctid:chararray, rule: chararray, value: chararray);
DATA_SET1 = FILTER set_1 BY (rule matches '.*R1.*');
DATA_SET2 = UNION DATA_SET1,set_2,set_3;
DATA_SET3 = ORDER DATA_SET2 by acctid,rule;
dump DATA_SET3;
Output
(123,R1,r1 version set 1 123)
(123,R2,r2 version set 2 123)
(123,R4,r4 version set 3 123)
(124,R1,r1 version set 1 124)
(124,R2,r2 version set 2 124)
(124,R4,r4 version set 3 124)
Upvotes: 1