Reputation: 2317
I have a large dataset in a numpy.ndarray
similar to this:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
I would like to remove duplicate rows, where the 0th, 1st, 2nd and 5th columns are equal. I.e. On the above matrix, the response would be:
-4, 5, 9, 30, 50, 80
2, -6, 9, 34, 12, 7
5, -9, 0, 32, 18, 0
numpy.unique
does something very similar but it only finds duplicates over all columns (axis). I only want specific columns. How would one get around to do this with numpy
? I could not find any decent numpy
algorithm to do this. Is there a better module?
Upvotes: 3
Views: 2255
Reputation: 180
You can use the np.take
method (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.take.html) to get the only the columns from the array that you care about and then use the unique method with return_index=True
.
>>> arr = np.array([[ -4, 5, 9, 30, 50, 80],
... [ 2, -6, 9, 34, 12, 7],
... [ -4, 5, 9, 98, -21, 80],
... [ 5, -9, 0, 32, 18, 0]])
>>> relevant_columns = np.take(arr, [0,1,2,5], axis=1)
>>> np.unique(relevant_columns, axis=0, return_index=True)
(array([[ 2, -6, 9, 7],
[ 5, -9, 0, 0],
[-4, 5, 9, 80]]), array([1, 3, 0]))
You can then use np.take()
again with your original numpy array. Pass array([1, 3, 0])
as the parameter for the indices.
Upvotes: 0
Reputation: 221614
Use np.unique
on the sliced array with return_index
param over axis=0
, that gives us unique indices, considering each row as one entity. These indices could be then used for row-indexing into the original array for the desired output.
So, with a
as the input array, it would be -
a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]
Sample run to break down the steps and hopefully make things clear -
In [29]: a
Out[29]:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
In [30]: a_slice = a[:,[0,1,2,5]]
In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)
In [32]: final_output = a[unq_row_indices]
In [33]: final_output
Out[33]:
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])
Upvotes: 5
Reputation: 164773
Pandas has functionality for this via pd.DataFrame.drop_duplicates
. However, the convenient syntax comes at the cost of performance.
import pandas as pd
import numpy as np
A = np.array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
res = pd.DataFrame(A)\
.drop_duplicates(subset=[0, 1, 2, 5])\
.values
print(res)
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])
Upvotes: 2