Reputation: 32081
I need the SQL equivalent of an AUTO_INCREMENT
id in hadoop.
When my reduce task identifies a new item, those items needs a unique ID assigned.
How can I share an atomic counter across the cluster? The reporter counters seem to be just increment counters, there's no getAndIncrement feature that I see.
How can I set that counter before the map/reduce phase of the job starts?
Upvotes: 3
Views: 2718
Reputation: 88428
To perform distributed id generation you can either just generate uuids or use functionality found in Apache Zookeeper, which can do distributed coordination on Hadoop clusters. Disclaimer: I have never used Zookeeper, so I don't know if you can really (even theoretically) get a global contiguous set of ids, which is what the question seems to be asking.
Generating UUIDs does have a cost, though; they take some time to generate.
For good general information on distributed ID generation, see this Stack Overflow question.
Upvotes: 2