Malik
Malik

Reputation: 97

Remove only rows which contain duplicates within that row of 3D numpy array

I have a 3D numpy array like this:

>>> a
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],
       [[6, 7, 8],
        [0, 1, 2],
        [6, 7, 8]],
       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

I want to remove only those rows which contain duplicates within themselves. For instance the output should look like this:

>>> remove_row_duplicates(a)
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

This is the function that I am using:

delindices = np.empty(0, dtype=int)

for i in range(len(a)):
    _, indices = np.unique(np.around(a[i], decimals=10), axis=0, return_index=True)

    if len(indices) < len(a[i]):

        delindices = np.append(delindices, i) 

a = np.delete(a, delindices, 0)

This works perfectly, but the problem is now my array shape is like (1000000,7,3). The for loop is pretty slow in python and this take a lot of time. Also my original array contains floating numbers. Any one who has a better solution or who can help me vectorizing this function?

Upvotes: 3

Views: 78

Answers (2)

Divakar
Divakar

Reputation: 221614

Sort it along the rows for each 2D block i.e. along axis=1 and then look for matching rows along the successive ones and finally look for any matches along the same axis=1 -

b = np.sort(a,axis=1)
out = a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]

Sample run with explanation

Input array :

In [51]: a
Out[51]: 
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],

       [[6, 7, 8],
        [0, 1, 2],
        [6, 7, 8]],

       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

Code steps :

# Sort along axis=1, i.e rows in each 2D block
In [52]: b = np.sort(a,axis=1)

In [53]: b
Out[53]: 
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],

       [[0, 1, 2],
        [6, 7, 8],
        [6, 7, 8]],

       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

In [54]: (b[:,1:] == b[:,:-1]).all(-1) # Look for successive matching rows
Out[54]: 
array([[ True, False],
       [False,  True],
       [False, False]])

# Look for matches along each row, which indicates presence
# of duplicate rows within each 2D block in original 2D array
In [55]: ((b[:,1:] == b[:,:-1]).all(-1)).any(1)
Out[55]: array([ True,  True, False])

# Invert those as we need to remove those cases
# Finally index with boolean indexing and get the output
In [57]: a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Out[57]: 
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

Upvotes: 2

Kasravnd
Kasravnd

Reputation: 107337

You can probably do this easily using broadcasting but since you're dealing with more than 2D arrays it wont be as optimized as you expect and even in some cases very slow. Instead you can use following approach inspired by Jaime's answer:

In [28]: u = np.unique(arr.view(np.dtype((np.void, arr.dtype.itemsize*arr.shape[1])))).view(arr.dtype).reshape(-1, arr.shape[1])

In [29]: inds = np.where((arr == u).all(2).sum(0) == u.shape[1])

In [30]: arr[inds]
Out[30]: 
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

Upvotes: 1

Related Questions