Yuhao
Yuhao

Reputation: 1610

Hadoop: How can I give every value a global unique ID number as key in Mapper?

Here is what I want to do. Now I have some text files like this:

<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>

<page>
<url>yyy.example.com</url>
<title>yyy</title>
<content>abcdef</content>
</page>

...

And I want to read the file split in mapper and convert them to key-value pairs, where each value is the content in one <page> tag.

My problem is about the key. I can use urls as keys because they are global unique. However, due to the context of my job, I want to generate a global unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution to this?

Upvotes: 4

Views: 3028

Answers (2)

Tejas Patil
Tejas Patil

Reputation: 6169

Not sure if this answers your question directly. But I am taking the advantage of the input file format.

You might use the NLineInputFormat and set N = 6 as each record encompasses 6 lines:

<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
.

With each record, the mapper would get the offset position in the file. This offset would be unique for each record.

PS: It would work only if the schema is fixed. I am doubtful if it would work properly for multiple input text files.

Upvotes: 0

Roman Nikitchenko
Roman Nikitchenko

Reputation: 13046

If you're going to process such files by MapReduce I'd take the following strategy:

  1. Use general text input format, line by line. This results every different file goes to different mapper job.
  2. In mapper build cycle which reads next lines in cycle through context.nextKeyValue() instead of being called for each line.
  3. Feed lines to some syntax analyzer (maybe you're just enough to read 6 non-empty lines, maybe you will use something like libxml but finally you will gen number of objects.
  4. If you intend to pass objects that you read to reducer you need to wrap them into something that implements Writable interaface.
  5. To form keys I'd use UUID implementation java.util.UUID. Something like:

    UUID key = UUID.randomUUID();

    It's enough if you're not generating billions records per second and your job does not take 100 years. :-)

  6. Just note - UUID should be probably encoded in ImmutableBytesWritable class, useful for such things.

  7. That's all, context.write(object,key).

OK, your reducer (if any) and output format is another story. You will definitely need output format to store your objects if you don't convert them to something like Text during the mapping.

Upvotes: 2

Related Questions