ensnare
ensnare

Reputation: 42043

Short Python alphanumeric hash with minimal collisions

I'd like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).

What are some alternative hash functions that perhaps use every letter in the alphabet as well as integers that are perhaps shorter in string length and have low collision rates?

Thanks!

Upvotes: 49

Views: 51702

Answers (5)

Ryabchenko Alexander
Ryabchenko Alexander

Reputation: 12390

I used xor of md5 bytes to get shorter hash

>>> d = hashlib.md5(b"hello worlds").digest()  # 16 bytes

# xor of bytes to get 3 hash bytes
>>> h = bytes([
     d[0] ^ d[1] ^ d[2] ^ d[3] ^ d[14] ^ d[15], 
     d[4] ^ d[5] ^ d[6] ^ d[7] ^ d[13], 
     d[8] ^ d[9] ^ d[10] ^ d[11] ^ d[12]],
     )  

>>> base64.urlsafe_b64encode(h)
b'8xC5'

# 4 digit str
>>> base64.urlsafe_b64encode(h).decode('utf-8')  
'8xC5'

Upvotes: 0

John La Rooy
John La Rooy

Reputation: 304215

The smallest builtin hash I am aware of is md5

>>> import hashlib, base64
>>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d); 
>>> print(d)

b'S27ylES0wiLdFAGdUpFgCQ=='

Low collision and short are somewhat mutually exclusive due to the birthday paradox

To make it urlsafe you need to use the function from the base64 module

>>> import base64
>>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest())
'XrY7u-Ae7tCTyyK7j1rNww=='

However there should be no problem storing the 16 byte md5 digest in the database in binary form.

>>> md5bytes=hashlib.md5("hello world").digest()
>>> len(md5bytes)
16
>>> urllib.quote_plus(md5bytes)
'%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3'

Python 2

>>> base64.urlsafe_b64encode(md5bytes)
'XrY7u-Ae7tCTyyK7j1rNww=='

Python 3

>>> base64.urlsafe_b64encode(md5bytes).decode('ascii')
'XrY7u-Ae7tCTyyK7j1rNww=='

You can choose either the quote_plus or the urlsafe_b64encode for your url, then decode with the corresponding function unquote_plus or urlsafe_b64decode before you look them up in the database.

Upvotes: 45

Matthew Flaschen
Matthew Flaschen

Reputation: 284836

Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.

import base64
import hashlib
hasher = hashlib.sha1("The quick brown fox")
base64.urlsafe_b64encode(hasher.digest()[:10])

You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.

EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses - and _ rather than + and /.

Upvotes: 45

Paulo Scardine
Paulo Scardine

Reputation: 77281

You can use something like base 32 notation. It is more compact than decimal notation, case insensitive and collision-free. Just encode a plain old sequence number to generate a short hash-like code.

If the key is not for human consumption, you can use base 64 notation, which is case sensitive but a little more compact.

See http://code.google.com/p/py-cupom/ for an example.

Upvotes: 0

Daniel Stutzbach
Daniel Stutzbach

Reputation: 76697

Below is a solution that uses alphanumeric characters plus a few punctuation characters. It returns very short strings (around 8 characters).

import binascii, struct

def myhash(s):
    return binascii.b2a_base64(struct.pack('i', hash(s)))

Upvotes: 2

Related Questions