Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)

Question

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).

I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:

The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
Collisions should basically never happen.

There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.

How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

user3790180 · Accepted Answer

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).

Collisions should basically never happen.

Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms

Example to 1st (optimizing algorithm),

Check file size.
If sizes are equal, check CRC
If CRCs are equal, then calculate and check hash. (both requires passing the file)

Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.

If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.

But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.

So try to implement most efficient algorithm according to your context.

Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)

Answers (1)

Related Questions