Hadoop: How can I give every value a global unique ID number as key in Mapper?

Question

Here is what I want to do. Now I have some text files like this:


xxx.example.com
xxx
abcdef



yyy.example.com
yyy
abcdef


...

And I want to read the file split in mapper and convert them to key-value pairs, where each value is the content in one > tag.



My problem is about the key. I can use urls as keys because they are global unique. However, due to the context of my job, I want to generate a global unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution to this?

Roman Nikitchenko · Accepted Answer

If you're going to process such files by MapReduce I'd take the following strategy:

Use general text input format, line by line. This results every different file goes to different mapper job.
In mapper build cycle which reads next lines in cycle through context.nextKeyValue() instead of being called for each line.
Feed lines to some syntax analyzer (maybe you're just enough to read 6 non-empty lines, maybe you will use something like libxml but finally you will gen number of objects.
If you intend to pass objects that you read to reducer you need to wrap them into something that implements Writable interaface.
To form keys I'd use UUID implementation java.util.UUID. Something like:

UUID key = UUID.randomUUID();

It's enough if you're not generating billions records per second and your job does not take 100 years. :-)
Just note - UUID should be probably encoded in ImmutableBytesWritable class, useful for such things.
That's all, context.write(object,key).

OK, your reducer (if any) and output format is another story. You will definitely need output format to store your objects if you don't convert them to something like Text during the mapping.

Hadoop: How can I give every value a global unique ID number as key in Mapper?

Answers (2)

Related Questions