user121196
user121196

Reputation: 31070

can md5 uniquely identify one hundred million strings

given hundreds of millions of unique strings with average length of a few hundreds, can md5 uniquely represent each of them? can collision occur? security is not a concern but uniqueness is.

Upvotes: 1

Views: 283

Answers (3)

jduncanator
jduncanator

Reputation: 2194

If MD5 distributes its results evenly along the 2^128 space (which it doesn't, but it's pretty close), you could calculate the chance of two values in a collection of size n having a collision. This is often referred to as 'the birthday problem'.

Some of this math may seem confusing so I'll explain it best as possible.

Let M be the size of the range of MD5 (2^128 since MD5 is a 128 bit hashing function)

Let n be the number of random values in this range (you said 100,000,000)

We can calculate p, the probability for at least one collision, with:

Equation

Using the values you supplied:

Equation

Thanks to Dukeling for providing the answer to the above equation, 1.46E-23, which comes out at 0.0000000000000000000000146. You can read more about the formulae here.

Upvotes: 4

David Schwartz
David Schwartz

Reputation: 182885

If you are concerned about an attacker maliciously constructing colliding strings, you cannot use MD5. If that's not an issue, MD5 is most likely good enough for your application with typical failure rates in realistic use cases on the order of one accidental collision per million years.

However, I would suggest picking something even more reliable so you don't have to worry about it. If nothing else, you will always have to defend your decision to use MD5 given that it's "known broken".

For example, you could use MD160 to get 160-bit hashes, SHA-1 to get 168-bit hashes, or SHA-256 to get 256-bit hashes. All of these algorithms have no known collisions despite effort to try to find them. Accidental collisions are billions of times less likely than failure due to asteroid impact.

The best choice depends on your priorities. What are the consequences of a collision? Do you have to resist malicious attacks? How critical is performance? How critical is hash size? Give us some more details and we can give you better advice.

Upvotes: 1

Bernhard Barker
Bernhard Barker

Reputation: 55649

For any type of hash function e.g. MD5, there exists 2 strings that hash to the same value. So, given any set of unique strings, you can't be sure 2 of these won't hash to the same value unless you analyse them in depth, or hash them all.

Upvotes: 1

Related Questions