Reputation: 13985
I am new to pig and hadoop. I need to select and group rows from multiple tables on a common column (but it is not a join).
For example if table 1 is:
adv1,app1,adg1,camp1
adv2,app3,adg2,camp2
and table 2 is:
adv1,app2,adg2,camp1
adv3,app1,adg3,camp3
adv1,app1,adg4,camp2
then I want something like this:
adv1,app1,adg1,camp1
adv1,app2,adg2,camp1
adv1,app1,adg4,camp2
Upvotes: 0
Views: 1654
Reputation: 5694
I think what you want to do is load both tables and then perform a filter operation.
If you have a file called table1
that has contents
adv1,app1,adg1,camp1
adv2,app3,adg2,camp2
and file table2
that has contents
adv1,app2,adg2,camp1
adv3,app1,adg3,camp3
adv1,app1,adg4,camp2
then you can do the following:
T = load '{/path/to/table1,/path/to/table2}' using PigStorage(',')
as (adv:chararray, app:chararray, adg:chararray, camp:chararray);
result = filter T by adv == 'adv1';
> dump result
(adv1,app2,adg2,camp1)
(adv1,app1,adg4,camp2)
(adv1,app1,adg1,camp1)
Alternatively, if you have two relations that you aren't loading from a file, you can use union to combine them, and then filter.
> dump T1
(adv1,app1,adg1,camp1)
(adv2,app3,adg2,camp2)
> dump T2
(adv1,app2,adg2,camp1)
(adv3,app1,adg3,camp3)
(adv1,app1,adg4,camp2)
T = union T1, T2;
result = filter T by adv == 'adv1';
Furthermore, if you're trying to group based on all possible keys and not just adv1, then you can do a group instead of a filter for the last line.
result = group T by adv;
> dump result
(adv1,{(adv1,app1,adg1,camp1),(adv1,app2,adg2,camp1),(adv1,app1,adg4,camp2)})
(adv2,{(adv2,app3,adg2,camp2)})
(adv3,{(adv3,app1,adg3,camp3)})
Upvotes: 1