balpha
balpha

Reputation: 50948

Handling close-to-impossible collisions on should-be-unique values

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.

A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.

Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states

The odds of the changeset hash colliding by accident in your first billion commits is basically zero. But we will notice if it happens. And you'll get to be famous as the guy who broke SHA1 by accident.

But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.

Rather, I'd like to know (maybe by means of examples from your own work):

For me, especially the following are interesting, although they are somewhat touchy-feely:

Upvotes: 7

Views: 827

Answers (2)

Aaron Maenpaa
Aaron Maenpaa

Reputation: 122960

Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:

1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.

The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.

p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)

where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:

Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png

Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.

Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

Upvotes: 4

Steve Jessop
Steve Jessop

Reputation: 279335

  • If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?

The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.

However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.

Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.

If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.

  • Or should even these cases get handled gracefully?

The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).

Upvotes: 7

Related Questions