Travis
Travis

Reputation: 13

Finding image similarities in a folder of thousands

I've cobbled together/wrote some code (Thanks stackoverflow users!) that checks for similarities in images using imagehash, but now I am having issues checking thousands of images (roughly 16,000). Is there anything that I could improve with the code (or a different route entirely) that can more accurately find matches and/or decrease time required? Thanks!

I first changed my list that is created to an itertools combination, so it only compares unique combinations of images.

new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')

duplicates = []
dup = []

for f1, f2 in itertools.combinations(dirloc,2):
    #Honestly not sure which hash method to use, so I went with dhash.
    dhash1 = imagehash.dhash(Image.open(f1))
    dhash2 = imagehash.dhash(Image.open(f2))
    hashdif = dhash1 - dhash2


    if hashdif < 5:  #May change the 5 to find more accurate matches
            print("images are similar due to dhash", "image1", f1, "image2", f2)
            duplicates.append(f1)
            dup.append(f2)

    #Setting up a CSV file with the similar images to review before deleting
    with open("duplicates.csv", "w") as myfile:
        wr = csv.writer(myfile)
        wr.writerows(zip(duplicates, dup)) 

Currently, this code may take days to process the number of images I have in the folder. I'm hoping to reduce this down to hours if possible.

Upvotes: 1

Views: 991

Answers (1)

Jmonsky
Jmonsky

Reputation: 1519

Try this, instead of hashing each image at comparison (127,992,000 hashes), you hash ahead of time and compare the hashes since those are not going to change (16,000 hashes).

new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')

duplicates = []
dup = []

hashes = []

for file in dirloc:
    hashes.append((file, imagehash.dhash(Image.open(file))))

for pair1, pair2 in itertools.combinations(hashes,2):
    f1, dhash1 = pair1
    f2, dhash2 = pair2
    #Honestly not sure which hash method to use, so I went with dhash.
    hashdif = dhash1 - dhash2


    if hashdif < 5:  #May change the 5 to find more accurate matches
            print("images are similar due to dhash", "image1", f1, "image2", f2)
            duplicates.append(f1)
            dup.append(f2)

#Setting up a CSV file with the similar images to review before deleting
with open("duplicates.csv", "w") as myfile: # also move this out of the loop so you arent rewriting the file every time
    wr = csv.writer(myfile)
    wr.writerows(zip(duplicates, dup)) 

Upvotes: 1

Related Questions