Reputation: 15777
I'm trying to detect duplicates and near-duplicates in a numpy array of images in Python. I'm using notMNIST (character image) data from this assignment/tutorial.
My approach consists of checking for an exact match of each image from one array in another array, but it is slow and I'm not sure it's working properly anyways.
The arrays have shapes (num_of_images, 28, 28).
exactOverlap = {} # using a dict because there are train, validation, and test arrays
exactOverlap['train-val'] = 0
eoIdxs = {}
eoIdxs['train-val'] = []
# check for exact matches
for img in range(train_dataset.shape[0]):
if np.equal(valid_dataset, train_dataset[img]).any(1).all():
exactOverlap['train-val'] += 1
ims.append(train_dataset[img])
eoIdxs['train-val'].append((img,
np.where(valid_dataset == train_dataset[img])))
print(exactOverlap)
I'd like to have 'eoIdxs' be tuples of the indices in train_dataset and valid_dataset where there is a match.
This answer seems to have some clues as to how to do it, and I could see PCA and/or cv2 being useful, but I haven't gotten anything but brute-force working.
Upvotes: 2
Views: 1606
Reputation: 178
The key for speed is to avoid comparing all pairs, and avoiding mirroring is not enough, as it will still be O(n^2).
EqualityEqual images will have equal hashes (under any hash function), so only images with matching hashes are possibly equal.
Hash all images (linear time), and look for hash collisions, those are your only duplicate candidates, but you'll only be sure after testing for equality.
SimilarityHowever, if you want to try similarity, you must find an appropriate hashing function that makes similar images collide, those hash functions are known as Locality-sensitive. You can simply check for hash collisions now, and assume that they will be similar, or apply another, probably more expensive, similarity function to compare the images.
(=If you look again at equality and similarity detection now, they are pretty similar. You map candidates together through a hash function, and then compare only the candidates.
complexityIf you think about it, now the problem may still be O(n^2), as we are using the same all-pairs comparison in all buckets. To get linear complexity is essential to allow the hashing function to have enough classes to make every bucket have no more than a constant number of images (this requires that images are not duplicated in each dataset (not too many times :P)).
Upvotes: 3
Reputation: 21
Regarding the part of your question
I'm not sure it's working properly anyways
I believe you have a bug on the line where you check for an exact match:
if np.equal(valid_dataset, train_dataset[img]).any(1).all()
For illustration purposes, let's consider a toy example where all images are 5x5 and there are only 3 images in the valid_dataset
. Let's go through possible steps of your check when the image is contained in the valid_dataset
, e.g. let's check the 2nd image from the valid_dataset
. In this case np.equal(valid_dataset, train_dataset[img])
might give us, e.g.,
[[[ True False True False False]
[False False True False False]
[False False True False False]
[False False False True False]
[False False False False False]]
[[ True True True True True]
[ True True True True True]
[ True True True True True]
[ True True True True True]
[ True True True True True]]
[[ True True True True False]
[ True True True False True]
[ True True False True False]
[ True False True False False]
[False True False False True]]]
Next, you apply .any(1)
to this 3D result. This operation looks at projections on the 2nd dimension and tells us whether or not at least one value is True
. Hence, the result of applying .any(1)
would have shape 3x5, e.g., the value on the position [0,0] is a logical OR of the following values from our example
[[[ True ..... ..... ..... .....]
[False ..... ..... ..... .....]
[False ..... ..... ..... .....]
[False ..... ..... ..... .....]
[False ..... ..... ..... .....]]
[...]
[...]]
Thus, in the case of our toy example the result of applying .any(1)
would be
[[ True False True True False]
[ True True True True True]
[ True True True True True]]
Applying .all()
to that would result into False
even though the image is contained in the valid_data
.
Correct solution:
What you want to do is to check that all pixels of the tested image are the same as pixels of at least one image in the valid_dataset
. In other words, for the match (i.e. for the condition to be True
) you require that in the valid_dataset
all values projected to the plane given by the 2nd and the 3rd dimensions are the same as values of the tested image pixels for at least one valid_dataset
image. Therefore, the condition should be
if np.equal(valid_dataset, train_dataset[img]).all(axis = (1,2)).any()
Upvotes: 2
Reputation: 10759
The numpy_indexed package (disclaimer: I am its author) will allow you to find exact matches trivially and efficiently.
import numpy_indexed as npi
unique = npi.unique(train_dataset)
duplicate = npi.multiplicity(train_dataset) > 1
It does not offer any help with inexact matches though; that's fundamentally quite a different problem.
Upvotes: 1