Reputation: 85036
I am creating a service that accepts a file as input and then performs some processing on that file. I would like to create a checksum of the file and then check a database to see if that file has already been processed and then pull the data from there rather than reprocessing it.
I have a few questions about this process.
1) Do I need to worry about checksum collisions? AKA - could two files ever return the same checksum?
2) I'm planning on using MD5 to calculate the hash - are there any faster ways to do this? Are there algorithms I should consider for other reasons?
Upvotes: 1
Views: 2193
Reputation: 12075
1) Do I need to worry about checksum collisions? I'm planning on using MD5 to calculate the has
There is a difference between a checksum (e. g. crc32) and cryptographic hash. Cryptographic hash is designed to be collision resistant.
It means using a hash may be the best option you have. The probability of collision is very low, maybe negligible, mathematically still higher than zero.
I'm planning on using MD5 to calculate the hash - are there any faster ways to do this? Are there algorithms I should consider for other reasons?
MD5 is fast, but not secure anymore. The hash has been broken and there are fast methods to produce multiple inputs resulting to the same hash output. Standard used today for hash is sha-256 (Until you are using md5 as checksum not concerning intentional collisions, you may be ok. Regardless that you should avoid crypto primitives which are considered obsolete)
Upvotes: 4