Reputation: 4601

Join Two Equal Datasets Without A Key

Using Hadoop I would like to join two files that have equal number of records, but do not carry a line number. For example A.txt

a xx
b y
c z

and B.txt

1 r
2 s
3 d

After join I need to have

a xx 1 r
b y 2 s
3 d c z

This is IOW perfect side-by-side concatenation. I could not figure out how to do this in Hadoop, I believe I would need an initial pass on both files to append a line number?

Answers that utilize Pig, and/or various combination of map/reduce tricks are all fine.

Upvotes: 1

Answers (3)

cabad

Reputation: 4575

This should work in Pig:

A = load 'A.txt';
B = load 'B.txt';

rankedA = RANK A;
joined = JOIN rankedA BY $0, B BY $0;

You can then further reorder the columns with a FOREACH statement if you want to.

Upvotes: 1

DDW

Reputation: 2015

This post gives you a hint: SO POST about special input format

Instead of giving byte offsets the input format could produce line numbers as key. That way you can simply use a unit mapper (just emitting key values) and do the concatenation in the reducer. It may seem hard but it's just overwriting a couple of functions in the input format and you're done.

Upvotes: 1

Binary01

Reputation: 695

I think as the two files have equal number of records so you can do the following to join using only one pass (one map reduce job):-

You can load the two files into two different temp tables.
Now you can create a UDF in Hive to generate line number(say starting from 1) and select the fields from Hive temp tables to create you final tables which will contain three columns i.e the extra column will contain the line numbers.
Now you can join the two final tables using the line numbers.

Hope this may help your cause.

Upvotes: 0

Join Two Equal Datasets Without A Key

Answers (3)

Related Questions