Reputation: 97
I have a 3D numpy array like this:
>>> a
array([[[0, 1, 2],
[0, 1, 2],
[6, 7, 8]],
[[6, 7, 8],
[0, 1, 2],
[6, 7, 8]],
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
I want to remove only those rows which contain duplicates within themselves. For instance the output should look like this:
>>> remove_row_duplicates(a)
array([[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
This is the function that I am using:
delindices = np.empty(0, dtype=int)
for i in range(len(a)):
_, indices = np.unique(np.around(a[i], decimals=10), axis=0, return_index=True)
if len(indices) < len(a[i]):
delindices = np.append(delindices, i)
a = np.delete(a, delindices, 0)
This works perfectly, but the problem is now my array shape is like (1000000,7,3). The for loop is pretty slow in python and this take a lot of time. Also my original array contains floating numbers. Any one who has a better solution or who can help me vectorizing this function?
Upvotes: 3
Views: 78
Reputation: 221614
Sort it along the rows for each 2D block
i.e. along axis=1
and then look for matching rows along the successive ones and finally look for any
matches along the same axis=1
-
b = np.sort(a,axis=1)
out = a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Sample run with explanation
Input array :
In [51]: a
Out[51]:
array([[[0, 1, 2],
[0, 1, 2],
[6, 7, 8]],
[[6, 7, 8],
[0, 1, 2],
[6, 7, 8]],
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
Code steps :
# Sort along axis=1, i.e rows in each 2D block
In [52]: b = np.sort(a,axis=1)
In [53]: b
Out[53]:
array([[[0, 1, 2],
[0, 1, 2],
[6, 7, 8]],
[[0, 1, 2],
[6, 7, 8],
[6, 7, 8]],
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
In [54]: (b[:,1:] == b[:,:-1]).all(-1) # Look for successive matching rows
Out[54]:
array([[ True, False],
[False, True],
[False, False]])
# Look for matches along each row, which indicates presence
# of duplicate rows within each 2D block in original 2D array
In [55]: ((b[:,1:] == b[:,:-1]).all(-1)).any(1)
Out[55]: array([ True, True, False])
# Invert those as we need to remove those cases
# Finally index with boolean indexing and get the output
In [57]: a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Out[57]:
array([[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
Upvotes: 2
Reputation: 107337
You can probably do this easily using broadcasting but since you're dealing with more than 2D arrays it wont be as optimized as you expect and even in some cases very slow. Instead you can use following approach inspired by Jaime's answer:
In [28]: u = np.unique(arr.view(np.dtype((np.void, arr.dtype.itemsize*arr.shape[1])))).view(arr.dtype).reshape(-1, arr.shape[1])
In [29]: inds = np.where((arr == u).all(2).sum(0) == u.shape[1])
In [30]: arr[inds]
Out[30]:
array([[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
Upvotes: 1