Reputation: 305
Hi I'm relatively new to programming in Pig and encountered an issue that I am having a hard time resolving:
I have 2 data sets
A: (accountId:chararray, title:chararray, genre:chararray)
("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")
B: (accountId:chararray, title:chararray, genre:chararray)
("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")
The result I want should be
(accountId:charray, {(),(),...}
(A123, {("A123", "Harry Potter", "Action/Adventure"),
("A123", "Sherlock Holmes", "Mystery"),
("A123", "Divergent", "Action"),
("A123", "Downton Abbey", "Drama")
})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama"),
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
Currently I am doing:
ANS = JOIN A BY accountId, B BY accountId;
but the result looks like
SCHEMA: (accountId:chararray, {(accountId:chararray, title:chararray, genre:chararray), ...})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama")}
"B456", {
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
Any idea what I may be doing incorrectly.
Upvotes: 0
Views: 159
Reputation: 561
Try this:
-- IMPORTANT: register datafu.jar
define BagConcat datafu.pig.bags.BagConcat();
A = load 'A' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);
B = load 'B' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);
C = cogroup A by id, B by id;
D = foreach C generate BagConcat(A, B);
dump D;
The JOIN will simply JOIN rows from your two relations as they are. You want to accomplish two things:
The two actions are performed by COGROUP. The best explanation I read for it is here: http://joshualande.com/cogroup-in-pig/
Your relation will now contain the group-key (ID) and two bags (one from A, one from B) each containing the rows from the original relation; the way to 'unite' them into one bag is by using the BagConcat function from datafu.jar. datafu is a library of PIG UDFs, that's full of goodies. You can read about it here: http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html
Upvotes: 1