Hadoop: How to create an auto-increment id

Question

I need the SQL equivalent of an AUTO_INCREMENT id in hadoop.

When my reduce task identifies a new item, those items needs a unique ID assigned.

How can I share an atomic counter across the cluster? The reporter counters seem to be just increment counters, there's no getAndIncrement feature that I see.
How can I set that counter before the map/reduce phase of the job starts?

Ray Toal · Accepted Answer

To perform distributed id generation you can either just generate uuids or use functionality found in Apache Zookeeper, which can do distributed coordination on Hadoop clusters. Disclaimer: I have never used Zookeeper, so I don't know if you can really (even theoretically) get a global contiguous set of ids, which is what the question seems to be asking.

Generating UUIDs does have a cost, though; they take some time to generate.

For good general information on distributed ID generation, see this Stack Overflow question.

Answers (1)