Reputation: 170489
I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.
My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.
Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.
Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes compared to the original hash?
Upvotes: 9
Views: 2937
Reputation: 78934
Another way to shorten the hash is to convert it to something more efficient than HEX like Base64 or some variant there-of.
Even if you're determined to take on 4 characters, taking 4 characters of base64 gives you more bits than hex.
Upvotes: 1
Reputation: 52539
The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:
http://en.wikipedia.org/wiki/Birthday_paradox
The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.
EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.
Upvotes: 8
Reputation: 72655
Here is a related topic you may refer to
What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?
Upvotes: 1