Apache Pig : Select from multiple table on a common column without join

Question

I am new to pig and hadoop. I need to select and group rows from multiple tables on a common column (but it is not a join).

For example if table 1 is:

adv1,app1,adg1,camp1  
adv2,app3,adg2,camp2

and table 2 is:

adv1,app2,adg2,camp1  
adv3,app1,adg3,camp3  
adv1,app1,adg4,camp2

then I want something like this:

adv1,app1,adg1,camp1  
adv1,app2,adg2,camp1  
adv1,app1,adg4,camp2

cyang · Accepted Answer

I think what you want to do is load both tables and then perform a filter operation.

If you have a file called table1 that has contents

adv1,app1,adg1,camp1
adv2,app3,adg2,camp2

and file table2 that has contents

adv1,app2,adg2,camp1
adv3,app1,adg3,camp3
adv1,app1,adg4,camp2

then you can do the following:

T = load '{/path/to/table1,/path/to/table2}' using PigStorage(',')
        as (adv:chararray, app:chararray, adg:chararray, camp:chararray);
result = filter T by adv == 'adv1';

> dump result
(adv1,app2,adg2,camp1)
(adv1,app1,adg4,camp2)
(adv1,app1,adg1,camp1)

Alternatively, if you have two relations that you aren't loading from a file, you can use union to combine them, and then filter.

> dump T1
(adv1,app1,adg1,camp1)
(adv2,app3,adg2,camp2)
> dump T2
(adv1,app2,adg2,camp1)
(adv3,app1,adg3,camp3)
(adv1,app1,adg4,camp2)

T = union T1, T2;
result = filter T by adv == 'adv1';

Furthermore, if you're trying to group based on all possible keys and not just adv1, then you can do a group instead of a filter for the last line.

result = group T by adv;

> dump result
(adv1,{(adv1,app1,adg1,camp1),(adv1,app2,adg2,camp1),(adv1,app1,adg4,camp2)})
(adv2,{(adv2,app3,adg2,camp2)})
(adv3,{(adv3,app1,adg3,camp3)})

Apache Pig : Select from multiple table on a common column without join

Answers (1)

Related Questions