Reputation: 3
I'm creating tools using Python 3.8.1 for organizing my images. One of those tools is to detect and separate similar images. So, I need an algorithm to compare each images with all other images in that folder, more correctly, I need a BETTER algorithm to compare each images with all other images in that folder.
My approach for this problem is like this :
Let's take an example of 6 images. We name the image as 1, 2, 3, 4, 5, and 6. This is all the possible comparisons between each image to all images in that folder :
1 -> 1 2 -> 1 3 -> 1 4 -> 1 5 -> 1 6 -> 1
1 -> 2 2 -> 2 3 -> 2 4 -> 2 5 -> 2 6 -> 2
1 -> 3 2 -> 3 3 -> 3 4 -> 3 5 -> 3 6 -> 3
1 -> 4 2 -> 4 3 -> 4 4 -> 4 5 -> 4 6 -> 4
1 -> 5 2 -> 5 3 -> 5 4 -> 5 5 -> 5 6 -> 5
1 -> 6 2 -> 6 3 -> 6 4 -> 6 5 -> 6 6 -> 6
There are 6 * 6 = 36 comparisons.
Next, because we are comparing each images for finding similar image, it is logical to exclude comparisons between itself, so we need to remove comparison 1 -> 1, 2 -> 2, and so on. We also need to exclude comparing the images two times, for example, comparing 1 -> 2 and then comparing 2 -> 1 again. Logically speaking, what is the difference between comparing "me" and "you", and comparing "you" and "me". If "I" really am different than "you", then you don't need to compare it again.
Thus, the rest of the comparisons are :
1 -> 2
1 -> 3 2 -> 3
1 -> 4 2 -> 4 3 -> 4
1 -> 5 2 -> 5 3 -> 5 4 -> 5
1 -> 6 2 -> 6 3 -> 6 4 -> 6 5 -> 6
This reduces the total of images to compare to 5 + 4 + 3 + 2 + 1 = 15 comparisons, less than half the original.
I implement this using a method that takes a list of all the images in that folder, and then returning a list of pairs of two images according to the logic above. Here's the method :
def get_cmpr_pairs_list( self ):
cmprPairsList = []
for i in range(0, self.imgCount):
for j in range(i+1, self.imgCount):
cmprPairsList.append( [self.filenames[i], self.filenames[j]] )
return cmprPairsList
Seeing that it isn't enough, I use multiprocessing
module to split the task of comparing these pairs of images to all my CPU cores (which is an 8 core CPU). This is the method that I create to compare all images within the folder :
def find_similars_all( self ):
print("Finding similar images...\n")
similarImg = [] # a list that stores similar images
cmprPairsList = self.get_cmpr_pairs_list() # get the comparison pairs list
self.cmprPairsCount = len(cmprPairsList)
# create a multiprocess pool
with multiprocessing.get_context("spawn").Pool() as p :
print("Total images to compare : {} images\n".format(self.cmprPairsCount))
# find all similar image and add it to similarImg
similarImg = p.map(self.compare_img, cmprPairsList)
# remove all None from the list
similarImg = list(filter(None, similarImg))
# melt all the list result into a single list
# from [[], [], ...] into [ ... ]
similarImg = itertools.chain.from_iterable(similarImg)
# remove duplicates
similarImg = list(dict.fromkeys(similarImg))
# if we have found some similar images...
if similarImg:
self.move_filenames(similarImg, "Similars")
else :
print("\nNo similar images found.")
print("\nDone searching for similar images!")
self.update_filenames()
The last one, this is the method that each process will run to compare each images. It will return the pair of images if the image pair is similar :
def compare_img( self, imagePair ):
# get the pair of image filename that we want to compare
imgA, imgB = imagePair[0], imagePair[1]
imgA = os.path.join(self.dirname, imgA)
imgB = os.path.join(self.dirname, imgB)
# ------ just making a counter for the image comparisons ------
global counter
counter.increment()
print("{}/{} images compared...".format(counter.value(), self.cmprPairsCount), end="\r")
# ------ just making a counter for the image comparisons ------
# ---- comparison algorithm starts here ----
# add the threshold
threshold = 1 - self.similarity_percentage/100
diff_limit = int(threshold*(self.hash_size**2))
# create the average hash of the image
with Image.open(imgA) as img:
hash1 = imagehash.average_hash(img, self.hash_size).hash
with Image.open(imgB) as img:
hash2 = imagehash.average_hash(img, self.hash_size).hash
result = np.count_nonzero(hash1 != hash2) <= diff_limit
# ---- comparison algorithm stops here ----
# this part will conclude whether the two image is similar or not
if result:
print("{} image found {}% similar to {}".format(imgA, self.similarity_percentage, imgB))
return imagePair
I tested this algorithm using 750 images, it yields 280,875 comparisons in total, with a total time of 30 minute - 1 hour, I think. My CPU usage is 100% all that time, which makes me think that this thing can be used as a stress test for my CPU lol. Also, I heavily constraint this test by storing the image files on a RAMDisk, so Python won't have a problem when opening the files. I think, the only bottleneck is the total number of comparisons.
Back to my main question, is there some way to improve this? Like reducing the number of comparisons with some special algorithms, or comparing the image faster, or anything that can improve this much more. But, I am more interested on finding other algorithms that can compare each image, with other images within the folder.
I will really appreciate the help. Thanks!
Upvotes: 0
Views: 826
Reputation: 92
Do not calculate hashes for each comparison. Open every file, calculate hash, store it. Then compare stored hashes for all combinations.
Upvotes: 2