Is there any better algorithms to compare each image, with other images within the folder?

Question

I'm creating tools using Python 3.8.1 for organizing my images. One of those tools is to detect and separate similar images. So, I need an algorithm to compare each images with all other images in that folder, more correctly, I need a BETTER algorithm to compare each images with all other images in that folder.

My approach for this problem is like this :
Let's take an example of 6 images. We name the image as 1, 2, 3, 4, 5, and 6. This is all the possible comparisons between each image to all images in that folder :
1 -> 1 2 -> 1 3 -> 1 4 -> 1 5 -> 1 6 -> 1
1 -> 2 2 -> 2 3 -> 2 4 -> 2 5 -> 2 6 -> 2
1 -> 3 2 -> 3 3 -> 3 4 -> 3 5 -> 3 6 -> 3
1 -> 4 2 -> 4 3 -> 4 4 -> 4 5 -> 4 6 -> 4
1 -> 5 2 -> 5 3 -> 5 4 -> 5 5 -> 5 6 -> 5
1 -> 6 2 -> 6 3 -> 6 4 -> 6 5 -> 6 6 -> 6
There are 6 * 6 = 36 comparisons.
Next, because we are comparing each images for finding similar image, it is logical to exclude comparisons between itself, so we need to remove comparison 1 -> 1, 2 -> 2, and so on. We also need to exclude comparing the images two times, for example, comparing 1 -> 2 and then comparing 2 -> 1 again. Logically speaking, what is the difference between comparing "me" and "you", and comparing "you" and "me". If "I" really am different than "you", then you don't need to compare it again.
Thus, the rest of the comparisons are :
1 -> 2
1 -> 3 2 -> 3
1 -> 4 2 -> 4 3 -> 4
1 -> 5 2 -> 5 3 -> 5 4 -> 5
1 -> 6 2 -> 6 3 -> 6 4 -> 6 5 -> 6
This reduces the total of images to compare to 5 + 4 + 3 + 2 + 1 = 15 comparisons, less than half the original.
I implement this using a method that takes a list of all the images in that folder, and then returning a list of pairs of two images according to the logic above. Here's the method :

def get_cmpr_pairs_list( self ):
    cmprPairsList = []
    for i in range(0, self.imgCount):
        for j in range(i+1, self.imgCount):
            cmprPairsList.append( [self.filenames[i], self.filenames[j]] )
    return cmprPairsList

Seeing that it isn't enough, I use multiprocessing module to split the task of comparing these pairs of images to all my CPU cores (which is an 8 core CPU). This is the method that I create to compare all images within the folder :

def find_similars_all( self ):
    print("Finding similar images...
")

    similarImg = []         # a list that stores similar images
    cmprPairsList = self.get_cmpr_pairs_list()       # get the comparison pairs list
    self.cmprPairsCount = len(cmprPairsList)

    # create a multiprocess pool
    with multiprocessing.get_context("spawn").Pool() as p :
        print("Total images to compare : {} images
".format(self.cmprPairsCount))
        # find all similar image and add it to similarImg
        similarImg = p.map(self.compare_img, cmprPairsList)
        # remove all None from the list
        similarImg = list(filter(None, similarImg))
        # melt all the list result into a single list
        # from [[], [], ...] into [ ... ]
        similarImg = itertools.chain.from_iterable(similarImg)
        # remove duplicates
        similarImg = list(dict.fromkeys(similarImg))

    # if we have found some similar images...
    if similarImg:
        self.move_filenames(similarImg, "Similars")
    else :
        print("
No similar images found.")

    print("
Done searching for similar images!")
    self.update_filenames()

The last one, this is the method that each process will run to compare each images. It will return the pair of images if the image pair is similar :

def compare_img( self, imagePair ):
    # get the pair of image filename that we want to compare
    imgA, imgB = imagePair[0], imagePair[1]
    imgA = os.path.join(self.dirname, imgA)
    imgB = os.path.join(self.dirname, imgB)
    
    # ------ just making a counter for the image comparisons ------
    global counter
    counter.increment()
    print("{}/{} images compared...".format(counter.value(), self.cmprPairsCount), end="
")
    # ------ just making a counter for the image comparisons ------

    # ---- comparison algorithm starts here ----
    # add the threshold
    threshold = 1 - self.similarity_percentage/100
    diff_limit = int(threshold*(self.hash_size**2))
    
    # create the average hash of the image
    with Image.open(imgA) as img:
        hash1 = imagehash.average_hash(img, self.hash_size).hash
    
    with Image.open(imgB) as img:
        hash2 = imagehash.average_hash(img, self.hash_size).hash
    
    result = np.count_nonzero(hash1 != hash2) <= diff_limit
    # ---- comparison algorithm stops here ----

    # this part will conclude whether the two image is similar or not
    if result:
        print("{} image found {}% similar to {}".format(imgA, self.similarity_percentage, imgB))
        return imagePair

I tested this algorithm using 750 images, it yields 280,875 comparisons in total, with a total time of 30 minute - 1 hour, I think. My CPU usage is 100% all that time, which makes me think that this thing can be used as a stress test for my CPU lol. Also, I heavily constraint this test by storing the image files on a RAMDisk, so Python won't have a problem when opening the files. I think, the only bottleneck is the total number of comparisons.

Back to my main question, is there some way to improve this? Like reducing the number of comparisons with some special algorithms, or comparing the image faster, or anything that can improve this much more. But, I am more interested on finding other algorithms that can compare each image, with other images within the folder.

I will really appreciate the help. Thanks!

Spi1y · Accepted Answer

Do not calculate hashes for each comparison. Open every file, calculate hash, store it. Then compare stored hashes for all combinations.

Is there any better algorithms to compare each image, with other images within the folder?

Answers (1)

Related Questions