Aurélien Pierre
Aurélien Pierre

Reputation: 713

Compute similarity between images using a hash to detect near-duplicates

Let's say I have a massive SQL database indexing image files and the files themselves. Some files could be indexed twice or more, some may have a corrupted copy or a more recent version indexed along with the original file.

Detecting exact duplicates can be done easily by computing the MD5 hash of the files, but is there a similar method that can be used to detect near-duplicates (that have a strong similarity without being exactly the same file), in order to remove them from the database ?

To be clear, I want to avoid at all cost things like computing the euclidian distance for every combination of images in the database, that would just take ages.

Upvotes: 1

Views: 724

Answers (1)

Kornel
Kornel

Reputation: 100190

For searches in SQL the most convenient way is to compute a perceptual hash and use it to find potential duplicates. For slightly better results, you can compute several variations of perceptual hashes per image, and count how many match (Jaccard distance).

For perceptual hashes there are dedicated libraries. If you want to roll your own, and don't need detection of variants that are cropped or rotated, then a simple approach is to resize all images to 32×32, maximize contrast, posterize, and hash the resulting pixels.

If you don't need to use SQL, then it is feasible to quickly find duplicates based only on euclidean distance between pairs of images, even if you have millions of them — by using Vantage-point tree. It's roughly a binary tree that divides nodes between near and far, so each comparison almost halves the number of images you need to search.

Upvotes: 1

Related Questions