Reputation: 3157
I have a data set that may contain duplicates. In order to find the duplicates in the dataset I put the indices into a numpy structured array, sort the array, create another array from the unique values and then compare the lengths of the two arrays:
data = np.zeros(t_len, dtype={'names':['date', 'symbol'], 'formats':['i8', 'S16']})
data[:] = [(x['date'], x['symbol']) for x in tbl.iterrows()]
data.sort(order=['date', 'symbol'])
data2 = np.unique(data)
duplicates = False
if len(data) != len(data2):
duplicates = True
print "There are duplicates"
if not duplicates:
print "No duplicates found"
Now, what I would really like to do is determine the indices that contain the duplicates. For example, if I had a dataset that contained:
array([12322323,'IBM'], [12322323,'IBM'], [12322323,'MSFT'], [12322323,'IBM'])
I would like to know see an array with array([12322323,'IBM'])
I've looked into using unique and difference functions, but those don't seem to do the job.
Upvotes: 0
Views: 1555
Reputation: 114811
For simplicity, I'll just use an array of integers, x
, as the input:
>>> x = np.array([20, 10, 30, 10, 60, 30, 10])
With numpy version 1.9.0 or later, we can use np.unique
to get the unique elements, with the argument return_counts=True
so that the number of occurrences of each unique element is also returned
>>> u, counts = np.unique(x, return_counts=True)
For older versions of numpy, one can use np.unique
with the argument return_inverse=True
to also get the array that shows how to recreate x
from the array of unique elements:
>>> u, inv = np.unique(x, return_inverse=True)
>>> u
array([10, 20, 30, 60])
>>> inv
array([1, 0, 2, 0, 3, 2, 0])
Now use bincount
to count the number of occurrences of each element:
>>> counts = np.bincount(inv)
>>> counts
array([3, 1, 2, 1])
So now we have counts
, which tells us how many times each element occurs in the array. We can pull out the elements that have duplicates as follows:
>>> dups = u[counts > 1]
>>> dups
array([10, 30])
Upvotes: 2