How to find the nearest neighbor of the Hamming distance in a SQL database efficiently?

Question

On my server, I would like to add a feature to find duplicate images. That way when it downloads an image from somewhere else, it can determine if the same image in a different size already exists, thus saving storage space. This task can be accomplished by a perceptual hash, which calculates a hash value for each image, and whenever the hash value of a particular image is less than a threshold with it, it assumes that the two images are the same. Unfortunately, the number of images is so large that it would be unacceptable to iterate all existing hashes with O(n) complexity every time there is a new image. Although there are data structures like the Vantage-point tree that speed up neighbor lookups, I can't load all of this data into memory. So I would like to ask what is the best way to find the nearest neighbor of the Hamming distance in the database with less complexity.

I looked up the results on the internet. They are either for a smaller amount of data, using a data structure for the lookup, or a query in MySQL shaped like "where bitcount(x ^ hash) <=5" - it will be linear complexity due to the absence of indexes. The best result I've found is https://github.com/KDJDEV/imagehash-reverse-image-search-tutorial who seems to offer a PostgreSQL plugin to support bk-tree indexes. But this plugin only supports 64-bit hashes, which is not enough for me to extend it to my program.

How to find the nearest neighbor of the Hamming distance in a SQL database efficiently?

Answers (0)

Related Questions