aquil.abdullah
aquil.abdullah

Reputation: 3157

Finding the difference between two numpy structured arrays

I have a data set that may contain duplicates. In order to find the duplicates in the dataset I put the indices into a numpy structured array, sort the array, create another array from the unique values and then compare the lengths of the two arrays:

data = np.zeros(t_len, dtype={'names':['date', 'symbol'], 'formats':['i8', 'S16']})
data[:] = [(x['date'], x['symbol']) for x in tbl.iterrows()]
data.sort(order=['date', 'symbol'])
data2 = np.unique(data)
duplicates = False

if len(data) != len(data2):
    duplicates = True
    print "There are duplicates"

if not duplicates:
    print "No duplicates found"

Now, what I would really like to do is determine the indices that contain the duplicates. For example, if I had a dataset that contained:

array([12322323,'IBM'], [12322323,'IBM'], [12322323,'MSFT'], [12322323,'IBM'])

I would like to know see an array with array([12322323,'IBM'])

I've looked into using unique and difference functions, but those don't seem to do the job.

Upvotes: 0

Views: 1555

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114811

For simplicity, I'll just use an array of integers, x, as the input:

>>> x = np.array([20, 10, 30, 10, 60, 30, 10])

With numpy version 1.9.0 or later, we can use np.unique to get the unique elements, with the argument return_counts=True so that the number of occurrences of each unique element is also returned

>>> u, counts = np.unique(x, return_counts=True)

For older versions of numpy, one can use np.unique with the argument return_inverse=True to also get the array that shows how to recreate x from the array of unique elements:

>>> u, inv = np.unique(x, return_inverse=True)
>>> u
array([10, 20, 30, 60])
>>> inv
array([1, 0, 2, 0, 3, 2, 0])

Now use bincount to count the number of occurrences of each element:

>>> counts = np.bincount(inv)
>>> counts
array([3, 1, 2, 1])

So now we have counts, which tells us how many times each element occurs in the array. We can pull out the elements that have duplicates as follows:

>>> dups = u[counts > 1]
>>> dups
array([10, 30])

Upvotes: 2

Related Questions