remove duplicate elements from two numpy arrays

Question

I have two numpy arrays a and b, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,

a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])

From those two arrays, we have a[2]&b[2] is the same as a[11]&b[11], then we call it duplicate element, which should be removed. The same as a[1]&b[1] vs a[3]&b[3]Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:

a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])

Anyone has the cleverest way to implement such reduction?

B. M. · Accepted Answer

First you have to pack a and b to identify duplicates. If values are positive integers (see the edit in other cases), this can be achieved by :

base=a.max()+1
c=a+base*b

Then just find unique values in c:

val,ind=np.unique(c,return_index=True)

and retrieve the associated values in a and b.

ind.sort()
print(a[ind])
print(b[ind])

for the disparition of the duplicate. (two here):

[ 1  3  6  7  8  3  2  9 10 14]
[ 2  4 15  7  9  2  2  0 11  4]

EDIT

regardless of datatype, the c array can be made as follow, packing data to bytes :

ab=ascontiguousarray(vstack((a,b)).T) 
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)

remove duplicate elements from two numpy arrays

Answers (2)

Related Questions