Reputation: 7379
Given a 2-D numpy matrix, how to retain the N smallest elements in each row and change rest of them to 0
(zero).
For example: N=3
Input array:
1 2 3 4 5
4 3 6 1 0
6 5 3 1 2
Expected output:
1 2 3 0 0
0 3 0 1 0
0 0 3 1 2
Following is the code that I have tried and it works:
# distance_matrix is the given 2D array
N=3
for i in range(distance_matrix.shape[0]):
n_th_largest = np.sort(distance_matrix[i])[N]
for j in range(distance_matrix.shape[1]):
distance_matrix[i][j] = np.where(distance_matrix[i][j]<n_th_largest,distance_matrix[i][j],0)
# return distance_matrix
However, this operation involves iterating over every single element. Is there a faster way to solve this using np.argsort()
or any other function?
Upvotes: 4
Views: 739
Reputation: 33950
If we can use pandas
and have your input in a dataframe, this is a one-liner with .apply(..., axis=1)
on each row:
df.apply(lambda row: row.nsmallest(3), axis=1).fillna(0).astype(int)
0 1 2 3 4
0 1 2 3 0 0
1 0 3 0 1 0
2 0 0 3 1 2
Notes:
nsmallest()
has a keep
argument specifying how to handle them.astype(int)
, since introducing the NaNs causes the dtype to be coerced up to float. You could avoid that if you wrote a custom function to replace with zeros.And here's the boilerplate to make your example reproducible:
import pandas as pd
from io import StringIO
dat = """1 2 3 4 5
4 3 6 1 0
6 5 3 1 2"""
df = pd.read_csv(StringIO(dat), sep='\s+', header=None)
Upvotes: 1
Reputation: 221614
Approach #1
Here's one with np.argpartition
for performance efficiency -
N = 3
newval = 0
np.put_along_axis(a,np.argpartition(a,N,axis=1)[:,N:],newval,axis=1)
Explanation : We partition the input array to get indices that are partitioned-across for the kth
argument in np.argpartition
. So, basically consider this as two partitions, with first one for smallest N elements along that axis and the other for the rest. We need to reset the second partition, which we select with [:,N:]
and we use np.put_along_axis
to do the resetting.
Sample run -
In [144]: a # input array
Out[144]:
array([[1, 2, 3, 4, 5],
[4, 3, 6, 1, 0],
[6, 5, 3, 1, 2]])
In [145]: np.put_along_axis(a,np.argpartition(a,3,axis=1)[:,3:],0,axis=1)
In [146]: a
Out[146]:
array([[1, 2, 3, 0, 0],
[0, 3, 0, 1, 0],
[0, 0, 3, 1, 2]])
Approach #2
Here's another again with np.argpartition
, but just slicing the Nth smallest element per row and then resetting all greater than it. As such, if there are duplicates for the Nth smallest element, we will keep all those with this method. Here's the implementation -
a[a>=a[np.arange(len(a)), np.argpartition(a,3,axis=1)[:,3],None]] = 0
Timings on a scaled up version -
In [184]: a = np.array([[1,2,3,4,5],[4,3,6,1,0],[6,5,3,1,2]])
In [185]: a = np.repeat(a,10000,axis=0)
In [186]: %timeit np.put_along_axis(a,np.argpartition(a,3,axis=1)[:,3:],0,axis=1)
1.78 ms ± 5.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [187]: a = np.array([[1,2,3,4,5],[4,3,6,1,0],[6,5,3,1,2]])
In [188]: a = np.repeat(a,10000,axis=0)
In [189]: %timeit a[a>=a[np.arange(len(a)), np.argpartition(a,3,axis=1)[:,3],None]] = 0
1.54 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 5