Faster algorithm for removing duplicates from dictionaries, compare of two

Question

I'm currently working on notMINST database using python 2.7, trying to remove duplicated images. I turn each image into MD5 hash, and created a dictionary image_hash

The first method works, however, it took almost an hour for there are altogether 500000 images in the dataset.

image_hash_identical = {}
for key,value in image_hash.items():
    if value not in image_hash_identical.values():
        image_hash_identical[key] = value

I tried to use 'set' function to create a second method to make things faster:

image_hash_set_values = list(set(image_hash.values()))
for i in range(len(image_hash_set_values)):
    for j in range(i, len(image_hash)):
        image_hash[j] == image_hash_set_values[i]:
            image_hash_identical[i] = image_hash[j]
            break

However, this code failed to accelerate the process for the 'set' function shuffled the order of image_hash. Are there any ways to inhibit the shuffling by 'set' function or any faster algorithms that can handle this situation?

user2390182 · Accepted Answer

Why not just keep track of seen values using a set:

image_hash_identical, seen = {}, set()
for key, value in image_hash.items():
    if value not in seen:  # contains of set: O(1)
        image_hash_identical[key] = value
        seen.add(value)

Faster algorithm for removing duplicates from dictionaries, compare of two

Answers (1)

Related Questions