Reputation: 860
I have two numpy arrays a
and b
, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,
a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])
From those two arrays, we have a[2]&b[2]
is the same as a[11]&b[11]
, then we call it duplicate element, which should be removed. The same as a[1]&b[1]
vs a[3]&b[3]
Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:
a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])
Anyone has the cleverest way to implement such reduction?
Upvotes: 2
Views: 5055
Reputation: 18628
First you have to pack a
and b
to identify duplicates.
If values are positive integers (see the edit in other cases), this can be achieved by :
base=a.max()+1
c=a+base*b
Then just find unique values in c
:
val,ind=np.unique(c,return_index=True)
and retrieve the associated values in a
and b
.
ind.sort()
print(a[ind])
print(b[ind])
for the disparition of the duplicate. (two here):
[ 1 3 6 7 8 3 2 9 10 14]
[ 2 4 15 7 9 2 2 0 11 4]
EDIT
regardless of datatype, the c array can be made as follow, packing data to bytes :
ab=ascontiguousarray(vstack((a,b)).T)
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)
Upvotes: 3
Reputation: 9796
This is done in one pass and without requiring any extra memory for the resulting arrays.
Pair up the elements at each index and iterate over them. Keep a track of which pairs have been seen so far and a counter of the index of the arrays. When a new pair has not been seen before, the index will increase by 1, effectively writing them back to their original place. However, for a duplicate pair you don't increase the index, effectively shifting every new pair one position to the left. At the end, keep the first index
th number of elements to shorten the arrays.
import itertools as it
def delete_duplicate_pairs(*arrays):
unique = set()
arrays = list(arrays)
n = range(len(arrays))
index = 0
for pair in it.izip(*arrays):
if pair not in unique:
unique.add(pair)
for i in n:
arrays[i][index] = pair[i]
index += 1
return [a[:index] for a in arrays]
If you are on Python 2, zip()
creates the list of pairs up front. If you have a lot of elements in your arrays, it'll be more efficient to use itertools.izip()
which will create the pairs as you request them. However, zip()
in Python 3 behaves like that by default.
For your case,
>>> import numpy as np
>>> a = np.array([1,3,6,3,7,8,3,2,9,10,14,6])
>>> b = np.array([2,4,15,4,7,9,2,2,0,11,4,15])
>>> a, b = delete_duplicate_pairs(a, b)
>>> a
array([ 1, 3, 6, 7, 8, 3, 2, 9, 10, 14])
>>> b
array([ 2, 4, 15, 7, 9, 2, 2, 0, 11, 4])
Now, it all comes down to what values your arrays hold. If you have only the values 0-9, there are only 100 unique pairs and most elements will be duplicates, which saves you time. For 20 million elements for both a
and b
and containing values only between 0-9, the process completes in 6 seconds. For values between 0-999, it takes 12 seconds.
Upvotes: 2