Huanian Zhang
Huanian Zhang

Reputation: 860

remove duplicate elements from two numpy arrays

I have two numpy arrays a and b, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,

a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])

From those two arrays, we have a[2]&b[2] is the same as a[11]&b[11], then we call it duplicate element, which should be removed. The same as a[1]&b[1] vs a[3]&b[3]Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:

a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])

Anyone has the cleverest way to implement such reduction?

Upvotes: 2

Views: 5055

Answers (2)

B. M.
B. M.

Reputation: 18628

First you have to pack a and b to identify duplicates. If values are positive integers (see the edit in other cases), this can be achieved by :

base=a.max()+1
c=a+base*b

Then just find unique values in c:

val,ind=np.unique(c,return_index=True)

and retrieve the associated values in a and b.

ind.sort()
print(a[ind])
print(b[ind])

for the disparition of the duplicate. (two here):

[ 1  3  6  7  8  3  2  9 10 14]
[ 2  4 15  7  9  2  2  0 11  4]

EDIT

regardless of datatype, the c array can be made as follow, packing data to bytes :

ab=ascontiguousarray(vstack((a,b)).T) 
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)

Upvotes: 3

Reti43
Reti43

Reputation: 9796

This is done in one pass and without requiring any extra memory for the resulting arrays.

Pair up the elements at each index and iterate over them. Keep a track of which pairs have been seen so far and a counter of the index of the arrays. When a new pair has not been seen before, the index will increase by 1, effectively writing them back to their original place. However, for a duplicate pair you don't increase the index, effectively shifting every new pair one position to the left. At the end, keep the first indexth number of elements to shorten the arrays.

import itertools as it

def delete_duplicate_pairs(*arrays):
    unique = set()
    arrays = list(arrays)
    n = range(len(arrays))
    index = 0
    for pair in it.izip(*arrays):
        if pair not in unique:
            unique.add(pair)
            for i in n:
                arrays[i][index] = pair[i]
            index += 1
    return [a[:index] for a in arrays]

If you are on Python 2, zip() creates the list of pairs up front. If you have a lot of elements in your arrays, it'll be more efficient to use itertools.izip() which will create the pairs as you request them. However, zip() in Python 3 behaves like that by default.

For your case,

>>> import numpy as np
>>> a = np.array([1,3,6,3,7,8,3,2,9,10,14,6])
>>> b = np.array([2,4,15,4,7,9,2,2,0,11,4,15])
>>> a, b = delete_duplicate_pairs(a, b)
>>> a
array([ 1,  3,  6,  7,  8,  3,  2,  9, 10, 14])
>>> b
array([ 2,  4, 15,  7,  9,  2,  2,  0, 11,  4])

Now, it all comes down to what values your arrays hold. If you have only the values 0-9, there are only 100 unique pairs and most elements will be duplicates, which saves you time. For 20 million elements for both a and b and containing values only between 0-9, the process completes in 6 seconds. For values between 0-999, it takes 12 seconds.

Upvotes: 2

Related Questions