Wilfred Hughes
Wilfred Hughes

Reputation: 31141

How do I calculate how many hashes I need in order to find a collision?

I'm working on a program that hashes image URLs to a 10 character string using hexadecimal characters, e.g. 64fd54ad29.

It's written in Python, and the hash is calculated like this:

def hash_short(self, url):
     return hashlib.sha1(url).hexdigest()[:10]

I'm concerned about collisions with such a short hash. I expected a collision after around one million hashes, but I required ten million hashes when I ran a brute force.

Calculations

A hexadecimal digit has 16 possible values, or 2^4. With ten characters I have 2^40 possibilities, or 40 bits of entropy.

To have a probability of 1, we'd need to look at 2^40 + 1 URLs (by the pigeonhole principle), but we would expect a collision much sooner.

A birthday attack (i.e. a bruteforce) of a n-bit hash will find a collision after 2^(n/2) attempts. Therefore we'll see a collision after around 2^20 URLs, which is 1,048,576.

Brute forcing

I wrote a simple Python script that iterated over a long list of URLs and compared each hash to those I'd seen before. It took me 10,800,000 URLs to find my first collision: "http://c69025.r25.cf3.rackcdn.com/_image1/_Model/34897.jpg" and "http://media.editd.com/assets/matrix/full/72f9a997b67c65c66f4adc769ee0a127d1db25eb.jpg" both hash to "ba2be44bd1".

import hashlib
import json

def calculate_short_hash(url):
    return hashlib.sha1(url).hexdigest()[:10]


def url_from_json(json_string):
    return json.loads(json_string)['image_url']

if __name__ == '__main__':
    short_hashes = set()

    for i, line in enumerate(open('urls.all')):
        short_hash = calculate_short_hash(url_from_json(line))

        if short_hash in short_hashes:
            print "Already seen: %s" % short_hash
            break
        else:
            short_hashes.add(short_hash)

        if i % 100000 == 0:
            print "Processed %d lines" % (i,)

Summary

Either my maths is incorrect or I was very unlucky. Which is it? How unlucky was I?

Upvotes: 4

Views: 1962

Answers (1)

Steve Jessop
Steve Jessop

Reputation: 279195

I think your collision detection code is wrong:

import hashlib
import random
import string

def hash_short(url):
     return hashlib.sha1(url).hexdigest()[:10]

hashes = dict()
while True:
    if len(hashes) % 10000 == 0:
        print len(hashes)
    newurl = ''.join(random.choice(string.lowercase) for _ in xrange(30))
    newhash = hash_short(newurl)
    if newhash in hashes and newurl != hashes[newhash]:
        print 'found a collision!'
        print newhash
        print newurl
        print hashes[newhash]
        print len(hashes)
        break
    hashes[newhash] = newurl

Output (run once):

...
770000
780000
found a collision!
216be03ec7
txnbkwrfkpkmiexloxrifdsnjumkex
xlnmlhobtsswjvmqnjupaybkspptpo
780758

Obviously my so-called urls aren't, but that should make no difference with a good hash function (and SHA1 is good for this purpose). If you've found a dataset that genuinely has an unusually low rate of collisions on the first 5 bytes of SHA1 then well done! Try it again with the last 5 bytes :-)

How unlucky were you? By the time you have 10 million hashes, your 2**40 space is full to about one part in 100k. So the probability of no collision is roughly (finger in the air), (99999.0/100000) ** 10 million, which is 3.7e-44. So if my maths is correct [Edit: which it isn't, see comments] you were astronomically, convicted-beyond-reasonable-doubt unlucky.

As a conservative upper bound on the probability of getting no collision by chance, you did 9 million trials after there were 1 million hashes already in play. The probability of no collisions is strictly less than (999999.0 / 1000000) ** 9000000, which is only 0.0001. You could produce smaller such bounds by splitting it up a bit further: you made 1 million trials with 9 million hashes occupied. Or you could compute the probability exactly (which CodesInChaos did: 1e-20)

So, Bayesian statistics being what it is, I reckon the probability of a bug in your code is higher than all of those numbers, even the really large conservative bound :-)

Upvotes: 1

Related Questions