Reputation: 1610
Here is what I want to do. Now I have some text files like this:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
<page>
<url>yyy.example.com</url>
<title>yyy</title>
<content>abcdef</content>
</page>
...
And I want to read the file split in mapper and convert them to key-value pairs, where each value is the content in one <page
> tag.
My problem is about the key. I can use urls as keys because they are global unique. However, due to the context of my job, I want to generate a global unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution to this?
Upvotes: 4
Views: 3028
Reputation: 6169
Not sure if this answers your question directly. But I am taking the advantage of the input file format.
You might use the NLineInputFormat and set N = 6 as each record encompasses 6 lines:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
.
With each record, the mapper would get the offset position in the file. This offset would be unique for each record.
PS: It would work only if the schema is fixed. I am doubtful if it would work properly for multiple input text files.
Upvotes: 0
Reputation: 13046
If you're going to process such files by MapReduce I'd take the following strategy:
context.nextKeyValue()
instead of being called for each line.Writable
interaface.To form keys I'd use UUID implementation java.util.UUID
. Something like:
UUID key = UUID.randomUUID();
It's enough if you're not generating billions records per second and your job does not take 100 years. :-)
Just note - UUID should be probably encoded in ImmutableBytesWritable
class, useful for such things.
context.write(object,key)
.OK, your reducer (if any) and output format is another story. You will definitely need output format to store your objects if you don't convert them to something like Text
during the mapping.
Upvotes: 2