Nicomoto
Nicomoto

Reputation: 483

Joining on rows (Combining datasets from two different sources) in PIg

Suppose I have two datasets

ID Name
1 Dog
2 Cat

and another data set

ID Name Age
3 Man 23

I load both into pig, and drop the age field from the second, now how do I combine the two so that I get

Id Name
1 Dog
2 Cat
3 Man

They can be distributed across mappers and added to any mapper in any order. I just want them to be together so I can perform a sort like operation (removing duplication and fetching the most recent timestamp) in the next map-reduce phase.

Upvotes: 0

Views: 183

Answers (3)

harish kumar
harish kumar

Reputation: 45

set1 = load 'dataset1' as (ID,Name);

set2 = load 'dataset2' as (ID,Name,Age);

set3 = foreach set2 generate ID,Name;

Result = UNION set1,set3;

but the output order may changes

But when u dump , you may get

Id Name

3 Man

1 Dog

2 Cat

order will change when we perform UNION in pig, for the above data you may not find any difference but when we perform union for more than two files the order changes .

Upvotes: 0

Rengasamy
Rengasamy

Reputation: 1043

Try this,

set1 = load 'dataset1' as (ID,Name);
set2 = load 'dataset2' as (ID,Name,Age);
set3 = foreach set2 generate ID,Name;

Result = UNION set1,set3;

Upvotes: 1

psmith
psmith

Reputation: 1813

Use UNION : http://pig.apache.org/docs/r0.12.1/basic.html#union

As you can see in the examples, you don't need to remove Age field from set2 But if you want, just use GENERATE

set3 = foreach set2 GENERATE Id, Name.

set4 = set1 UNION set3

Regards

Upvotes: 1

Related Questions