Reputation: 4601
Using Hadoop I would like to join two files that have equal number of records, but do not carry a line number. For example A.txt
a xx
b y
c z
and B.txt
1 r
2 s
3 d
After join I need to have
a xx 1 r
b y 2 s
3 d c z
This is IOW perfect side-by-side concatenation. I could not figure out how to do this in Hadoop, I believe I would need an initial pass on both files to append a line number?
Answers that utilize Pig, and/or various combination of map/reduce tricks are all fine.
Upvotes: 1
Views: 128
Reputation: 4575
This should work in Pig:
A = load 'A.txt';
B = load 'B.txt';
rankedA = RANK A;
joined = JOIN rankedA BY $0, B BY $0;
You can then further reorder the columns with a FOREACH
statement if you want to.
Upvotes: 1
Reputation: 2015
This post gives you a hint: SO POST about special input format
Instead of giving byte offsets the input format could produce line numbers as key. That way you can simply use a unit mapper (just emitting key values) and do the concatenation in the reducer. It may seem hard but it's just overwriting a couple of functions in the input format and you're done.
Upvotes: 1
Reputation: 695
I think as the two files have equal number of records so you can do the following to join using only one pass (one map reduce job):-
Hope this may help your cause.
Upvotes: 0