Reputation: 4190
I have a continuously growing set of files and have to ensure that there are no duplicates. By duplicate I mean identical at byte level.
The files are getting collected from various system, some of them also providing hash codes of the files (but some don't). Some files may exist at multiple systems but should be imported only once.
I want do avoid unnecessary file transfers and I thought that I just compare hash codes before actually copying. However, as I said some of these systems don't provide a hash code and some use MD5 which I read isn't secure anymore.
My questions:
Upvotes: 1
Views: 499
Reputation: 3402
Firstly, the only way to conclusively proof two files are identical is to compare them bit for bit. As such you can't really avoid transferring the files if you want to compare them. So if you need absolute certainty you cannot avoid transferring the files. Unless you can make certain assumptions about the files that's just a mathematical truth.
And then we have hash functions. What a hash function tries to do is calculate some value which is highly likely to be different when the files are different. How likely depends on the actual function, a really stupid hash function might have a change of one in ten to produce the same hash for different files, for a good hash function those changes are insanely small. For md5 the change of finding two different files with the same hash is one in 2^128. I'm guessing that's good enough for your system, so you can safely assume the files are the same when the hash is the same.
Now for a secure hash, and md5 being broken. Hash functions are not just used as a quick way to check if things are the same. They are also used in cryptographic systems to verify things are the same. It's only in that sense that md5 is broken, it is possible to generate a file with a specific md5 hash relatively quick. If you fear someone might intentionally create a file with the same hash as another file to trick you into skipping it you shouldn't rely on md5. But that doesn't seem to be the case here. If no one is deliberately messing with the files md5 still works fine.
So to your first question, theoretically no, but realistically yes.
To the second question, you should calculate all the different hashes that might be used for each file you stored locally. E.g. calculate the md5 hash and the sha1 hash (or whatever hashes are being used on the remote systems). That way you will always have the correct type of hash to check against for each file you already have.
For the files which don't have a hash you can't do anything to avoid transferring them. Until you do there is nothing you know about those files. Once you transferred them you can still calculate a hash yourself so you can quickly check if you got that file.
Upvotes: 3