Uno
Uno

Reputation: 543

Join USING 'merge' in PIG

I am a Hadoop/PIG beginner.

Could anyone please tell me the difference between

grunt> A = join A by $1, B by $1 using 'merge';     

And
grunt> A = join A by $1, B by $1;

I have 2 files 1.txt and 2.txt which have the following data
1.txt
A 1
B 3
C 5
D 7

2.txt
AA 1
BB 2
CC 4
DD 6

And I want the output merged together like this
A 1
AA 1
BB 2
B 3
CC 4
C 5
DD 6
D 7

Will "using 'merge'" give me the desired output?

I tried, however it is not.

Can you let me know what am I missing here.

Upvotes: 1

Views: 5737

Answers (1)

Chris White
Chris White

Reputation: 30089

Sounds like you are getting an inner join (datasets joined by a common key) rather than an outer join (which is what it looks like you are after from your desired output).

Use the word keyword FULL to signify you want a full outer join:

grunt> A = join A by $1 FULL, B by $1 using 'merge';  

This may however yield unexpected results if you have a record in both datasets with the same $0 (see the example for inner join). You may also need to amend the output to drop the missing columns between the two datasets.

Alternatively, if you just want to append one dataset to another, and then sort, use the UNION and ORDER BY operators

grunt> U = UNION A, B;
grunt> OrderedU = ORDER U BY $1

See

for more information about each

Upvotes: 3

Related Questions