Reputation: 31
I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!
Upvotes: 3
Views: 408
Reputation: 1675
Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.
Upvotes: 2