Reputation: 323
I'm currently working on notMINST database using python 2.7, trying to remove duplicated images. I turn each image into MD5 hash, and created a dictionary image_hash
The first method works, however, it took almost an hour for there are altogether 500000 images in the dataset.
image_hash_identical = {}
for key,value in image_hash.items():
if value not in image_hash_identical.values():
image_hash_identical[key] = value
I tried to use 'set' function to create a second method to make things faster:
image_hash_set_values = list(set(image_hash.values()))
for i in range(len(image_hash_set_values)):
for j in range(i, len(image_hash)):
image_hash[j] == image_hash_set_values[i]:
image_hash_identical[i] = image_hash[j]
break
However, this code failed to accelerate the process for the 'set' function shuffled the order of image_hash. Are there any ways to inhibit the shuffling by 'set' function or any faster algorithms that can handle this situation?
Upvotes: 0
Views: 77
Reputation: 73470
Why not just keep track of seen values using a set:
image_hash_identical, seen = {}, set()
for key, value in image_hash.items():
if value not in seen: # contains of set: O(1)
image_hash_identical[key] = value
seen.add(value)
Upvotes: 1