Remove rows by duplicate column(s) values

I have a large dataset in a numpy.ndarray similar to this:

array([[ -4,   5,   9,  30,  50,  80],
       [  2,  -6,   9,  34,  12,   7],
       [ -4,   5,   9,  98, -21,  80],
       [  5,  -9,   0,  32,  18,   0]])

I would like to remove duplicate rows, where the 0th, 1st, 2nd and 5th columns are equal. I.e. On the above matrix, the response would be:

-4, 5, 9, 30, 50, 80
2, -6, 9, 34, 12, 7
5, -9, 0, 32, 18, 0

numpy.unique does something very similar but it only finds duplicates over all columns (axis). I only want specific columns. How would one get around to do this with numpy? I could not find any decent numpy algorithm to do this. Is there a better module?

Upvotes: 3

Answers (3)

sbrannon

Reputation: 180

You can use the np.take method (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.take.html) to get the only the columns from the array that you care about and then use the unique method with return_index=True.

>>> arr = np.array([[ -4,   5,   9,  30,  50,  80],
...        [  2,  -6,   9,  34,  12,   7],
...        [ -4,   5,   9,  98, -21,  80],
...        [  5,  -9,   0,  32,  18,   0]])
>>> relevant_columns = np.take(arr, [0,1,2,5], axis=1)
>>> np.unique(relevant_columns, axis=0, return_index=True)
(array([[ 2, -6,  9,  7],
       [ 5, -9,  0,  0],
       [-4,  5,  9, 80]]), array([1, 3, 0]))

You can then use np.take() again with your original numpy array. Pass array([1, 3, 0]) as the parameter for the indices.

Upvotes: 0

Divakar

Reputation: 221614

Use np.unique on the sliced array with return_index param over axis=0, that gives us unique indices, considering each row as one entity. These indices could be then used for row-indexing into the original array for the desired output.

So, with a as the input array, it would be -

a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]

Sample run to break down the steps and hopefully make things clear -

In [29]: a
Out[29]: 
array([[ -4,   5,   9,  30,  50,  80],
       [  2,  -6,   9,  34,  12,   7],
       [ -4,   5,   9,  98, -21,  80],
       [  5,  -9,   0,  32,  18,   0]])

In [30]: a_slice = a[:,[0,1,2,5]]

In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)

In [32]: final_output = a[unq_row_indices]

In [33]: final_output
Out[33]: 
array([[-4,  5,  9, 30, 50, 80],
       [ 2, -6,  9, 34, 12,  7],
       [ 5, -9,  0, 32, 18,  0]])

Upvotes: 5

jpp

Reputation: 164773

Pandas has functionality for this via pd.DataFrame.drop_duplicates. However, the convenient syntax comes at the cost of performance.

import pandas as pd
import numpy as np

A = np.array([[ -4,   5,   9,  30,  50,  80],
              [  2,  -6,   9,  34,  12,   7],
              [ -4,   5,   9,  98, -21,  80],
              [  5,  -9,   0,  32,  18,   0]])

res = pd.DataFrame(A)\
        .drop_duplicates(subset=[0, 1, 2, 5])\
        .values

print(res)

array([[-4,  5,  9, 30, 50, 80],
       [ 2, -6,  9, 34, 12,  7],
       [ 5, -9,  0, 32, 18,  0]])

Upvotes: 2

Remove rows by duplicate column(s) values

Answers (3)

Related Questions