Reputation: 4383
For uploading a file to a service, I was calculating the md5 based on the whole content of the file.
I was asked to do in a different way: the md5 of the file, and then also 3 more parts: 2% from the start of the file, 2% from 1/3 of the file, and 2% from 2/3, and 2% of the end of the file and then hash it the file's size and added the file size in bytes at the end.
Apparently this solves hash collisions between files. To me it seems like a waste of time, since your not increasing the size of md5. So for a huge large number of files, you're still gonna have, statistically, the same number of collisions.
Please help me understand the reasoning behind this.
EDIT: we are then hashing the resulting hashes.
Upvotes: 0
Views: 78
Reputation: 66783
A good cryptographically strong hashing algorithm is already designed with the goal to make it infeasible to intentionally find two different pieces of data with the same hash, let alone by accident. Therefore, just hashing the file is sufficient. Extra hashing of parts of the file is pointless.
This may seem unintuitive because obviously there must exist collisions if the length of the hash is shorter than the length of the data. However, it is not feasible to find these collisions because an MD5 hash is an unpredictable 128-bit number and the amount of possible 128-bit numbers (2^128) is mind boggling. If you could count at a rate of a trillion trillion per second, counting through all 128-bit numbers would still take (2^128 / 1e24) seconds ~ about 10 million years. This is probably a good lower limit to the amount of time that it would take to find a hash collision the brute force way without custom hardware.
That said, this is all assuming that there are no weaknesses in the hashing algorithm that allow you to do better than brute force. MD5 is broken in this regard, so you should not use it if you need to defend against attackers that would try to create collisions. It would be better to use a newer hashing algorithm like SHA-2 or SHA-3. (These also support even larger outputs such as 256 bits.)
Upvotes: 2
Reputation: 11515
Sounds like a dangerous practice, because you're re-hashing without factoring in a lot of data. The advantage however is that by running other hashes, you are effectivley winding up with a hash signature consisting of "more bits" - (i.e. you are getting three MD5 hashes as a result).
If you want to do this - and are in-fact okay with having more (larger) hash data to store/compare - you would be MUCH better advise to simply run a different hash function (other than MD5) that is either more secure, and/or uses a larger number of bits.
MD5 is an "older" algorithm and is known to have cryptographic weaknessess. I'd recommend one of the "SHA" algorithms - like SHA-256 or SHA-512. Advantages are that it is a stronger algorithm, you'd only have to has the data ONCE, and you'd get more bits than an MD5, yet since your running it once, it would be faster.
Note, that the possibility of hash collisions always exists. Even "high end" storage products which use hashes for detection will compare buffers to verify an exact match even if the comparison of two hashes matches.
Upvotes: 1