Reputation: 177
I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.
I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?
Upvotes: 1
Views: 4178
Reputation: 11911
To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.
What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.
Upvotes: 1
Reputation: 4842
The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.
Upvotes: 1
Reputation: 33501
Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.
Also, see response to a very similar question here: How many random elements before MD5 produces collisions?
Upvotes: 1