Reputation: 373
I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.
I am currently checking if the hashes are the same however this is a very long process. I am currently using:
def sameIm(file_name1,file_name2):
hash = imagehash.average_hash(Image.open(path + file_name1))
otherhash = imagehash.average_hash(Image.open(path + file_name2))
return (hash == otherhash)
Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.
Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?
or is there another way to compare files which is faster?
Thanks
Upvotes: 1
Views: 1648
Reputation: 113915
There is indeed a much faster way of doing this:
import collections
import glob
import os
def dupDetector(dirpath, ext):
hashes = collections.defaultdict(list)
for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
h = imagehash.average_hash(Image.open(fpath))
hashes[h].append(fpath)
for h,fpaths in hashes.items():
if len(fpaths) == 1:
print(fpaths[0], "is one of a kind")
continue
print("The following files are duplicates of each other (with the hash {}): \n\t{}".format(h, '\n\t'.join(fpaths)))
Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don't need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)
Upvotes: 3
Reputation: 1621
Just use map structure to calculate hashes for each image,then store hashes as a key and name of the image as a value. As a result you would have unique images names array.
def get_hash(filename):
return imagehash.average_hash(Image.open(path + filename))
def get_unique_images(filenames):
hashes = {}
for filename in filenames:
image_hash = get_hash(filename)
hashes[image_hash] = filename
return hashes.values()
Upvotes: 2
Reputation: 666
One solution is to keep using the hash but stocking it in a list of tuple (or a dic, i don't know wich is more efficient here) where the first element is the name of the image and the second is the hash. It should take aproximatively the same 5 mins.
If you have 5000 images, You compare the value of the first element of the list to the 4999 others
Then the second to the 4998 others (as you already checked the first one)
Then the third ...
This "just" make you do n²/2 comparisons (where n is the number of images)
Upvotes: 2
Reputation: 8378
Why not compute hash only once?
hashes = [imagehash.average_hash(Image.open(path + fn)) for fn in file_names]
def compare_hashes(hash1, hash2):
return hash1 == hash2
Upvotes: 3