Brendan Abel
Brendan Abel

Reputation: 37549

Safest way to generate a unique hash in Python

I need to produce unique identifiers that can be used in filenames and can be reproduced given the same input values. I need to produce millions of these identifiers as the source input has millions of combinations.

For simplicity's sake, I will use a small set in the example, but the actual sets can be rather large (hundreds, maybe thousands, of items); larger than could be manually encoded into a filename.

I noticed that the 5th method of generating UUID's allows you to provide a string input.

> input_set = {'apple', 'banana', 'orange'}
> uuid.uuid5(uuid.NAMESPACE_URL, pickle.dumps(input_set)).hex
'f39926529ad45997984643816c1bc403'

The documentation says it uses SHA1 under the hood. Is the risk of a collision too high? Is there a better way of reliably hashing unique identifiers?

Upvotes: 9

Views: 25361

Answers (3)

Thomas
Thomas

Reputation: 10135

Instead of using pysha3 (see DoesData's answer), you could also use the built-in module hashlib:

import hashlib

h = hashlib.sha3_512() # Python 3.6+
h.update(b"Hello World")
h.hexdigest()

Output:

'3d58a719c6866b0214f96b0a67b37e51a91e233ce0be126a08f35fdf4c043c6126f40139bfbc338d44eb2a03de9f7bb8eff0ac260b3629811e389a5fbee8a894'

Upvotes: 8

Asclepius
Asclepius

Reputation: 63524

If the smaller base64.urlsafe_b64encode output would be preferable:

> import base64, hashlib

> base64.urlsafe_b64encode(hashlib.sha3_512('asdf'.encode()).digest())
b'jYjPWyD1Os164UebWzbcICF1OwSZAsdyR7snsTGzAL08qL7vKHVtzie4mQhnxFd6JTXn47dRQTmcoalMyEsOuQ=='

The above output is of length 88 whereas the corresponding hex would be of length 128.

Upvotes: 0

DoesData
DoesData

Reputation: 7057

The odds that you'd get an SHA1 collision from strings is astoundingly low. Currently there are less than 63 known collisions for SHA1.

First ever SHA1 collision found

First ever' SHA-1 hash collision calculated. All it took were five clever brains... and 6,610 years of processor time

SHA1 is no longer considered secure in the cryptography world, but certainly exceeds your expectations here.

Cryptographic hashing functions are designed to be one way functions.This means the functions inverse is "hard" to calculate. (i.e. knowing the output in no way helps you determine the input) As Blender pointed out in the comments this has nothing to do with the chance of collisions.

Take a look at the Birthday Paradox for some basic information on how the probability of a collision is calculated.

This question addresses the likely hood of a SHA1 collision. This article states

A cryptographic hash function has provable security against collision attacks if finding collisions is provably polynomial-time reducible from problem P which is supposed to be unsolvable in polynomial time. The function is then called provably secure, or just provable.

Here is a list of "secure" hash algorithms.

UPDATE You stated in the comments your input is much larger than the 160 bit limit for SHA1. I recommend you use SHA3 in this case as there is no limit on the size of your input. Check out the Python documentation for more information.

Here is a basic example:

import sha3
k = sha3.keccak_512()
k.update(b"data")
k.hexdigest()
'1065aceeded3a5e4412e2187e919bffeadf815f5bd73d37fe00d384fe29f55f08462fdabe1007b993ce5b8119630e7db93101d9425d6e352e22ffe3dcb56b825'

Upvotes: 10

Related Questions