Biju CD
Biju CD

Reputation: 5109

Unique Key generation in Hive/Hadoop

While selecting a set of records from a big data hive table, a unique key needs to be created for each record. In a sequential mode of operation , it is easy to generate unique id by calling soem thing like max(id). Since hive runs the task in parallel, how can we generate unique key as part of a select query, without compromising the performance of hadoop. Is this really a map reduce problem or do we need to go for a sequential approach to solve this.

Upvotes: 2

Views: 5860

Answers (3)

leftjoin
leftjoin

Reputation: 38335

Use UUID instead of numbers. It works in a true distributed way.

select reflect("java.util.UUID", "randomUUID")

Upvotes: 1

KSN
KSN

Reputation: 143

SELECT T.*, ROW_NUMBER () OVER (ORDER BY T.C1) AS SEQ_NBR 
FROM TABLE T

Here C1 is any numeric column in T. This will generate a unique number for each record while selecting from table T, starting from 1. If this is one time activity then solution is fine.

In case you need to repeat this process every day and insert this data into table T2 and generate unique id then you can try below method.

SELECT T.*, ROW_NUMBER () OVER (ORDER BY T.C1)+ SEQ_T2  AS SEQ_NBR 
FROM TABLE T, (SELECT MAX(SEQ) AS SEQ_T2 FROM TABLE T2)

Hope it helps !!

Upvotes: 0

yurgis
yurgis

Reputation: 4077

If by some reason you do not want to deal with UUIDs, then this solution (based on numeric values) does not require your parallel units to "talk" to each other or synchronize whatsoever. Thus it is very efficient, but it does not guarantee that your integer keys are going to be continuous.

If you have say N parallel units of execution, and you know your N, and each unit is assigned an ID from 0 to N - 1, then you can simply generate a unique integer across all units

Unit #0:   0, N, 2N, 3N, ...
Unit #1:   1, N+1, 2N+1, 3N+1, ...
...
Unit #N-1: N-1, N+(N-1), 2N+(N-1), 3N+(N-1), ...

Depending on where you need to generate keys (mapper or reducer) you can get your N from hadoop configuration:

Mapper: mapred.map.tasks
Reduce: mapred.reduce.tasks

... and ID of your unit: In Java, it is:

 context.getTaskAttemptID().getTaskID().getId()

Not sure about Hive, but it should be possible as well.

Upvotes: 3

Related Questions