Reputation: 32955
In our app we're going to be handed png images along with a ~200 character byte array. I want to save the image with a filename corresponding to that bytearray, but not the bytearray itself, as i don't want 200 character filenames. So, what i thought was that i would save the bytearray into the database, and then MD5 it to get a short filename. When it comes time to display a particular image, i look up its bytearray, MD5 it, then look for that file.
So far so good. The problem is that potentially two different bytearrays could hash down to the same MD5. Then, one file would effectively overwrite another. Or could they? I guess my questions are
Upvotes: 3
Views: 23255
Reputation: 31
As other said before. Hash doesnt give you what you need unless you are fine with risk of collision.
Database is helpful here. You get unique index for each 200 long string. No collisions here, and you need to set your 200 long names to be indexed, in that way it will use extra memory but it will sort it for you making search very very fast. You get unique id which can be easily used for filenames.
Upvotes: 2
Reputation: 219
I have'nt worked much on hashing algorithms but as per my understanding there is always a chance of collison in hashing algorithm i.e. two differnce object may be hashed to same hash value but it is guaranteed that every time a object will be hashed to same hash value. There are other techniques that may be used for this , like linear probing.
Upvotes: 0
Reputation: 57690
The probability of two hashes will likely to collide depends on the hash size. MD5 produces 128-bit hash. So for 2128+1 number of hashes there will be at least one collision.
This number is 2160+1 for SHA1 and 2512+1 for SHA512.
Here this rule applies. The more the output bits the more uniqueness and more computation. So there is a trade off. What you have to do is to choose an optimal one.
Upvotes: 4
Reputation: 13285
When hashing (as opposed to encrypting), you're reducing the information space of the data being hashed, so there's always a chance of a collision.
The best you can hope for in a hash function is that all hashes are evenly distributed in the hash space and your hash output is large enough to provide your "once-per-10-ages-of-the-universe sort of deal" as you put it!
So whether a hash is "good enough" for you depends on the consequences of a collision. You could always add a unique id to a checksum/hash to get the best of both worlds.
Upvotes: 6
Reputation: 262842
Could two ~200 char bytearrays MD5-hash down to the same string?
Considering that there are more 200 byte strings than 32 byte strings (MD5 digests), that is guaranteed to be the case.
All hash functions have that problem, but some are more robust than MD5. Try SHA-1. git is using it for the same purpose.
Upvotes: 3
Reputation: 13743
It's logically impossible to get a 32 byte code from a 200 byte source which is unique among all possible 200 byte sources, since you can store more information in 200 bytes than in 32 bytes.
They only exception would be that the information stored in these 200 bytes would also fit into 32 bytes, in which case your source date format would be extremely inefficient and space-wasting.
Upvotes: 8
Reputation: 4584
It may happen that two MD5 hashes collides (are the same). In 1996, a flaw was found in MD5 algorithm, and cryptanalysts advised to switch to SHA-1 hashing algorithm.
So, I will advise you to switch to SHA-1 (40 characters). But do not worry: I doubt that your two pictures will get the same hash. I think you can assume this risk in your application.
Upvotes: 2