Reputation: 15777

What's the most efficient way to detect duplicates in numpy arrays of images with Python?

I'm trying to detect duplicates and near-duplicates in a numpy array of images in Python. I'm using notMNIST (character image) data from this assignment/tutorial.

My approach consists of checking for an exact match of each image from one array in another array, but it is slow and I'm not sure it's working properly anyways.

The arrays have shapes (num_of_images, 28, 28).

exactOverlap = {} # using a dict because there are train, validation, and test arrays
exactOverlap['train-val'] = 0
eoIdxs = {}
eoIdxs['train-val'] = []

# check for exact matches
for img in range(train_dataset.shape[0]):
    if np.equal(valid_dataset, train_dataset[img]).any(1).all():
        exactOverlap['train-val'] += 1
        ims.append(train_dataset[img])
        eoIdxs['train-val'].append((img,
                np.where(valid_dataset == train_dataset[img])))
print(exactOverlap)

I'd like to have 'eoIdxs' be tuples of the indices in train_dataset and valid_dataset where there is a match.

This answer seems to have some clues as to how to do it, and I could see PCA and/or cv2 being useful, but I haven't gotten anything but brute-force working.

Upvotes: 2

Answers (3)

Dietr1ch

Reputation: 178

The key for speed is to avoid comparing all pairs, and avoiding mirroring is not enough, as it will still be O(n^2).

Equality

Equal images will have equal hashes (under any hash function), so only images with matching hashes are possibly equal.

Hash all images (linear time), and look for hash collisions, those are your only duplicate candidates, but you'll only be sure after testing for equality.

Similarity

However, if you want to try similarity, you must find an appropriate hashing function that makes similar images collide, those hash functions are known as Locality-sensitive. You can simply check for hash collisions now, and assume that they will be similar, or apply another, probably more expensive, similarity function to compare the images.

If you look again at equality and similarity detection now, they are pretty similar. You map candidates together through a hash function, and then compare only the candidates.

complexity

If you think about it, now the problem may still be O(n^2), as we are using the same all-pairs comparison in all buckets. To get linear complexity is essential to allow the hashing function to have enough classes to make every bucket have no more than a constant number of images (this requires that images are not duplicated in each dataset (not too many times :P)).

Upvotes: 3

Raman Samusevich

Reputation: 21

Regarding the part of your question

I'm not sure it's working properly anyways

I believe you have a bug on the line where you check for an exact match:

if np.equal(valid_dataset, train_dataset[img]).any(1).all()

For illustration purposes, let's consider a toy example where all images are 5x5 and there are only 3 images in the valid_dataset. Let's go through possible steps of your check when the image is contained in the valid_dataset, e.g. let's check the 2nd image from the valid_dataset. In this case np.equal(valid_dataset, train_dataset[img]) might give us, e.g.,

[[[ True False  True False False]
  [False False  True False False]
  [False False  True False False]
  [False False False  True False]
  [False False False False False]]

 [[ True  True  True  True  True]
  [ True  True  True  True  True]
  [ True  True  True  True  True]
  [ True  True  True  True  True]
  [ True  True  True  True  True]]

 [[ True  True  True  True False]
  [ True  True  True False  True]
  [ True  True False  True False]
  [ True False  True False False]
  [False  True False False  True]]]

Next, you apply .any(1) to this 3D result. This operation looks at projections on the 2nd dimension and tells us whether or not at least one value is True. Hence, the result of applying .any(1) would have shape 3x5, e.g., the value on the position [0,0] is a logical OR of the following values from our example

[[[ True ..... ..... ..... .....]
  [False ..... ..... ..... .....]
  [False ..... ..... ..... .....]
  [False ..... ..... ..... .....]
  [False ..... ..... ..... .....]]

 [...]

 [...]]

Thus, in the case of our toy example the result of applying .any(1) would be

[[ True False  True  True False]
 [ True  True  True  True  True]
 [ True  True  True  True  True]]

Applying .all() to that would result into False even though the image is contained in the valid_data.

Correct solution:

What you want to do is to check that all pixels of the tested image are the same as pixels of at least one image in the valid_dataset. In other words, for the match (i.e. for the condition to be True) you require that in the valid_dataset all values projected to the plane given by the 2nd and the 3rd dimensions are the same as values of the tested image pixels for at least one valid_dataset image. Therefore, the condition should be

if np.equal(valid_dataset, train_dataset[img]).all(axis = (1,2)).any()

Upvotes: 2

Eelco Hoogendoorn

Reputation: 10759

The numpy_indexed package (disclaimer: I am its author) will allow you to find exact matches trivially and efficiently.

import numpy_indexed as npi
unique = npi.unique(train_dataset)
duplicate = npi.multiplicity(train_dataset) > 1

It does not offer any help with inexact matches though; that's fundamentally quite a different problem.

Upvotes: 1

What&#39;s the most efficient way to detect duplicates in numpy arrays of images with Python?

Answers (3)

Related Questions

What's the most efficient way to detect duplicates in numpy arrays of images with Python?