Max Williams
Max Williams

Reputation: 32955

hashing function guaranteed to be unique?

In our app we're going to be handed png images along with a ~200 character byte array. I want to save the image with a filename corresponding to that bytearray, but not the bytearray itself, as i don't want 200 character filenames. So, what i thought was that i would save the bytearray into the database, and then MD5 it to get a short filename. When it comes time to display a particular image, i look up its bytearray, MD5 it, then look for that file.

So far so good. The problem is that potentially two different bytearrays could hash down to the same MD5. Then, one file would effectively overwrite another. Or could they? I guess my questions are

Upvotes: 3

Views: 23255

Answers (8)

luiner
luiner

Reputation: 31

As other said before. Hash doesnt give you what you need unless you are fine with risk of collision.

Database is helpful here. You get unique index for each 200 long string. No collisions here, and you need to set your 200 long names to be indexed, in that way it will use extra memory but it will sort it for you making search very very fast. You get unique id which can be easily used for filenames.

Upvotes: 2

flying-high
flying-high

Reputation: 219

I have'nt worked much on hashing algorithms but as per my understanding there is always a chance of collison in hashing algorithm i.e. two differnce object may be hashed to same hash value but it is guaranteed that every time a object will be hashed to same hash value. There are other techniques that may be used for this , like linear probing.

Upvotes: 0

Shiplu Mokaddim
Shiplu Mokaddim

Reputation: 57690

The probability of two hashes will likely to collide depends on the hash size. MD5 produces 128-bit hash. So for 2128+1 number of hashes there will be at least one collision.

This number is 2160+1 for SHA1 and 2512+1 for SHA512.

Here this rule applies. The more the output bits the more uniqueness and more computation. So there is a trade off. What you have to do is to choose an optimal one.

Upvotes: 4

Widor
Widor

Reputation: 13285

When hashing (as opposed to encrypting), you're reducing the information space of the data being hashed, so there's always a chance of a collision.

The best you can hope for in a hash function is that all hashes are evenly distributed in the hash space and your hash output is large enough to provide your "once-per-10-ages-of-the-universe sort of deal" as you put it!

So whether a hash is "good enough" for you depends on the consequences of a collision. You could always add a unique id to a checksum/hash to get the best of both worlds.

Upvotes: 6

Stefan
Stefan

Reputation: 114248

Why don't you use a unique ID from your database?

Upvotes: 4

Thilo
Thilo

Reputation: 262842

Could two ~200 char bytearrays MD5-hash down to the same string?

Considering that there are more 200 byte strings than 32 byte strings (MD5 digests), that is guaranteed to be the case.

All hash functions have that problem, but some are more robust than MD5. Try SHA-1. git is using it for the same purpose.

Upvotes: 3

JohnB
JohnB

Reputation: 13743

It's logically impossible to get a 32 byte code from a 200 byte source which is unique among all possible 200 byte sources, since you can store more information in 200 bytes than in 32 bytes.

They only exception would be that the information stored in these 200 bytes would also fit into 32 bytes, in which case your source date format would be extremely inefficient and space-wasting.

Upvotes: 8

Jonathan Petitcolas
Jonathan Petitcolas

Reputation: 4584

It may happen that two MD5 hashes collides (are the same). In 1996, a flaw was found in MD5 algorithm, and cryptanalysts advised to switch to SHA-1 hashing algorithm.

So, I will advise you to switch to SHA-1 (40 characters). But do not worry: I doubt that your two pictures will get the same hash. I think you can assume this risk in your application.

Upvotes: 2

Related Questions