VSEWHGHP
VSEWHGHP

Reputation: 305

How to join 2 data sets in Pig same schema

Hi I'm relatively new to programming in Pig and encountered an issue that I am having a hard time resolving:

I have 2 data sets

A: (accountId:chararray, title:chararray, genre:chararray)

("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")

B: (accountId:chararray, title:chararray, genre:chararray)

("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")

The result I want should be

(accountId:charray, {(),(),...}

(A123, {("A123", "Harry Potter", "Action/Adventure"),
        ("A123", "Sherlock Holmes", "Mystery"),
        ("A123", "Divergent", "Action"),
        ("A123", "Downton Abbey", "Drama")
        })

(B456, {("B456", "James Bond", "Action"),
        ("B456", "Hamlet", "Drama"),
        ("B456", "Percy Jackson", "Action/Adventure"),
        ("B456", "Elementary", "Mystery")
        })

Currently I am doing:

ANS = JOIN A BY accountId, B BY accountId;

but the result looks like

SCHEMA: (accountId:chararray, {(accountId:chararray, title:chararray, genre:chararray), ...})

(B456, {("B456", "James Bond", "Action"),
        ("B456", "Hamlet", "Drama")}
       "B456", {
        ("B456", "Percy Jackson", "Action/Adventure"),
        ("B456", "Elementary", "Mystery")
        })

Any idea what I may be doing incorrectly.

Upvotes: 0

Views: 159

Answers (1)

Ran Locar
Ran Locar

Reputation: 561

Try this:

-- IMPORTANT: register datafu.jar
define BagConcat datafu.pig.bags.BagConcat();
A = load 'A' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);      
B = load 'B' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);   
C = cogroup A by id, B by id;
D = foreach C generate BagConcat(A, B);
dump D;

The JOIN will simply JOIN rows from your two relations as they are. You want to accomplish two things:

  • GROUP all rows belonging to the same account in each relation
  • JOIN the two 'grouped' relations (to get only IDs which exist in both relations)

The two actions are performed by COGROUP. The best explanation I read for it is here: http://joshualande.com/cogroup-in-pig/

Your relation will now contain the group-key (ID) and two bags (one from A, one from B) each containing the rows from the original relation; the way to 'unite' them into one bag is by using the BagConcat function from datafu.jar. datafu is a library of PIG UDFs, that's full of goodies. You can read about it here: http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html

Upvotes: 1

Related Questions