Tom Auger
Tom Auger

Reputation: 20101

How can I calculate the impact on collision probability when truncating a hash?

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.

I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.

The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.

Upvotes: 2

Views: 1162

Answers (2)

Tom Auger
Tom Auger

Reputation: 20101

@mypetition's answer is great.

I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:

...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).


A bit more about "N"

If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.

So you can think of N as 16 ^ (numberOfHexDigits).

Upvotes: 2

mypetlion
mypetlion

Reputation: 2524

You can calculate the chance of collisions with this formula:

chance of collision = 1 - e^(-n^2 / (2 * d))

Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).

Upvotes: 1

Related Questions