Reputation: 373

Check if files in dir are the same

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.

I am currently checking if the hashes are the same however this is a very long process. I am currently using:

def sameIm(file_name1,file_name2):
    hash = imagehash.average_hash(Image.open(path + file_name1))
    otherhash = imagehash.average_hash(Image.open(path + file_name2))

    return (hash == otherhash)

Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.

Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?

or is there another way to compare files which is faster?

Thanks

Upvotes: 1

Answers (4)

inspectorG4dget

Reputation: 113915

There is indeed a much faster way of doing this:

import collections
import glob
import os


def dupDetector(dirpath, ext):
    hashes = collections.defaultdict(list)
    for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
        h = imagehash.average_hash(Image.open(fpath))
        hashes[h].append(fpath)

    for h,fpaths in hashes.items():
        if len(fpaths) == 1:
            print(fpaths[0], "is one of a kind")
            continue
        print("The following files are duplicates of each other (with the hash {}): \n\t{}".format(h, '\n\t'.join(fpaths)))

Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don't need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Upvotes: 3

dubace

Reputation: 1621

Just use map structure to calculate hashes for each image,then store hashes as a key and name of the image as a value. As a result you would have unique images names array.

def get_hash(filename):
    return imagehash.average_hash(Image.open(path + filename))

def get_unique_images(filenames):
    hashes = {}
    for filename in filenames:
        image_hash = get_hash(filename)
        hashes[image_hash] = filename
    return hashes.values()

Upvotes: 2

Christophe Ramet

Reputation: 666

One solution is to keep using the hash but stocking it in a list of tuple (or a dic, i don't know wich is more efficient here) where the first element is the name of the image and the second is the hash. It should take aproximatively the same 5 mins.

If you have 5000 images, You compare the value of the first element of the list to the 4999 others

Then the second to the 4998 others (as you already checked the first one)

Then the third ...

This "just" make you do n²/2 comparisons (where n is the number of images)

Upvotes: 2

AGN Gazer

Reputation: 8378

Why not compute hash only once?

hashes = [imagehash.average_hash(Image.open(path + fn)) for fn in file_names]
def compare_hashes(hash1, hash2):
    return hash1 == hash2

Upvotes: 3

Check if files in dir are the same

Answers (4)

Related Questions