Reputation: 483
Suppose I have two datasets
ID Name
1 Dog
2 Cat
and another data set
ID Name Age
3 Man 23
I load both into pig, and drop the age field from the second, now how do I combine the two so that I get
Id Name
1 Dog
2 Cat
3 Man
They can be distributed across mappers and added to any mapper in any order. I just want them to be together so I can perform a sort like operation (removing duplication and fetching the most recent timestamp) in the next map-reduce phase.
Upvotes: 0
Views: 183
Reputation: 45
set1 = load 'dataset1' as (ID,Name);
set2 = load 'dataset2' as (ID,Name,Age);
set3 = foreach set2 generate ID,Name;
Result = UNION set1,set3;
but the output order may changes
But when u dump , you may get
Id Name
3 Man
1 Dog
2 Cat
order will change when we perform UNION in pig, for the above data you may not find any difference but when we perform union for more than two files the order changes .
Upvotes: 0
Reputation: 1043
Try this,
set1 = load 'dataset1' as (ID,Name);
set2 = load 'dataset2' as (ID,Name,Age);
set3 = foreach set2 generate ID,Name;
Result = UNION set1,set3;
Upvotes: 1
Reputation: 1813
Use UNION
: http://pig.apache.org/docs/r0.12.1/basic.html#union
As you can see in the examples, you don't need to remove Age field from set2
But if you want, just use GENERATE
set3 = foreach set2 GENERATE Id, Name.
set4 = set1 UNION set3
Regards
Upvotes: 1