Reputation: 31141
I'm working on a program that hashes image URLs to a 10 character string using hexadecimal characters, e.g. 64fd54ad29.
It's written in Python, and the hash is calculated like this:
def hash_short(self, url):
return hashlib.sha1(url).hexdigest()[:10]
I'm concerned about collisions with such a short hash. I expected a collision after around one million hashes, but I required ten million hashes when I ran a brute force.
A hexadecimal digit has 16 possible values, or 2^4. With ten characters I have 2^40 possibilities, or 40 bits of entropy.
To have a probability of 1, we'd need to look at 2^40 + 1 URLs (by the pigeonhole principle), but we would expect a collision much sooner.
A birthday attack (i.e. a bruteforce) of a n-bit hash will find a collision after 2^(n/2) attempts. Therefore we'll see a collision after around 2^20 URLs, which is 1,048,576.
I wrote a simple Python script that iterated over a long list of URLs and compared each hash to those I'd seen before. It took me 10,800,000 URLs to find my first collision: "http://c69025.r25.cf3.rackcdn.com/_image1/_Model/34897.jpg"
and "http://media.editd.com/assets/matrix/full/72f9a997b67c65c66f4adc769ee0a127d1db25eb.jpg"
both hash to "ba2be44bd1"
.
import hashlib
import json
def calculate_short_hash(url):
return hashlib.sha1(url).hexdigest()[:10]
def url_from_json(json_string):
return json.loads(json_string)['image_url']
if __name__ == '__main__':
short_hashes = set()
for i, line in enumerate(open('urls.all')):
short_hash = calculate_short_hash(url_from_json(line))
if short_hash in short_hashes:
print "Already seen: %s" % short_hash
break
else:
short_hashes.add(short_hash)
if i % 100000 == 0:
print "Processed %d lines" % (i,)
Either my maths is incorrect or I was very unlucky. Which is it? How unlucky was I?
Upvotes: 4
Views: 1962
Reputation: 279195
I think your collision detection code is wrong:
import hashlib
import random
import string
def hash_short(url):
return hashlib.sha1(url).hexdigest()[:10]
hashes = dict()
while True:
if len(hashes) % 10000 == 0:
print len(hashes)
newurl = ''.join(random.choice(string.lowercase) for _ in xrange(30))
newhash = hash_short(newurl)
if newhash in hashes and newurl != hashes[newhash]:
print 'found a collision!'
print newhash
print newurl
print hashes[newhash]
print len(hashes)
break
hashes[newhash] = newurl
Output (run once):
...
770000
780000
found a collision!
216be03ec7
txnbkwrfkpkmiexloxrifdsnjumkex
xlnmlhobtsswjvmqnjupaybkspptpo
780758
Obviously my so-called urls aren't, but that should make no difference with a good hash function (and SHA1 is good for this purpose). If you've found a dataset that genuinely has an unusually low rate of collisions on the first 5 bytes of SHA1 then well done! Try it again with the last 5 bytes :-)
How unlucky were you? By the time you have 10 million hashes, your 2**40
space is full to about one part in 100k. So the probability of no collision is roughly (finger in the air), (99999.0/100000) ** 10 million
, which is 3.7e-44
. So if my maths is correct [Edit: which it isn't, see comments] you were astronomically, convicted-beyond-reasonable-doubt unlucky.
As a conservative upper bound on the probability of getting no collision by chance, you did 9 million trials after there were 1 million hashes already in play. The probability of no collisions is strictly less than (999999.0 / 1000000) ** 9000000
, which is only 0.0001. You could produce smaller such bounds by splitting it up a bit further: you made 1 million trials with 9 million hashes occupied. Or you could compute the probability exactly (which CodesInChaos did: 1e-20
)
So, Bayesian statistics being what it is, I reckon the probability of a bug in your code is higher than all of those numbers, even the really large conservative bound :-)
Upvotes: 1