Reputation: 435
I have a 2D matrix with values, and I want to find the top 5 values' indices. For example for
matrix([[0.17542851, 0.13199346, 0.01579704, 0.01429822, 0.01302919],
[0.13279703, 0.12444886, 0.04742024, 0.03114371, 0.02623729],
[0.13502306, 0.07815065, 0.07291175, 0.03690815, 0.02163695],
[0.19032505, 0.15853737, 0.05889324, 0.02791679, 0.02699252],
[0.1695696 , 0.14538635, 0.07127667, 0.04997876, 0.02580234]])
I want to get (0,3), (0,1), (0,4), (3,1), (4,1)
I searched and tried many workaround, including np.argmax(), np.argsort(), np.argpartition()
without any good results.
For example:
>>np.dstack(np.unravel_index(np.argsort(a.ravel(),axis=None), a.shape))
array([[[0, 4],
[0, 3],
[0, 2],
[2, 4],
[4, 4],
[1, 4],
[3, 4],
[3, 3],
[1, 3],
[2, 3],
[1, 2],
[4, 3],
[3, 2],
[4, 2],
[2, 2],
[2, 1],
[1, 1],
[0, 1],
[1, 0],
[2, 0],
[4, 1],
[3, 1],
[4, 0],
[0, 0],
[3, 0]]], dtype=int64)
this result makes no sense. Notice that I want the original indices, I don't care about the order (just want the top 5 in any order, ascending will be better though)
Upvotes: 4
Views: 1623
Reputation: 26311
I landed here from a question that references @Divakar (very elegant and fast) answer.
A common issue with rank is how to handle duplicates (ties).
In some cases, it is desirable to use "dense rank", in which [4, 7, 7, 9]
would be ranked (in ascending order): [0, 1, 1, 2]
.
By contrast, @Divakar's answer essentially reflects "ordinal ranking", in which [4, 7, 7, 9]
would be ranked (in ascending order) [0, 1, 2, 3]
. This can be slightly counter-intuitive in a "top k" question. For example, on:
b = np.array(
[[8, 6, 3],
[6, 7, 2],
[0, 8, 9]])
with rank k=2
and (and assuming descending order), it gives:
k = 2
>>> np.c_[np.unravel_index(np.argpartition(b.ravel(),-k)[-k:], b.shape)]
array([[2, 1],
[2, 2]])
which corresponds to the 9
and only one of the 8
values, leaving out the other 8
value.
If anyone is interested in the "dense rank", I would propose the following (which returns all indices of top k values, in "any order" --actually, in index order):
def topk_indices(a, k):
_, rix = np.unique(-a, return_inverse=True)
return np.c_[np.unravel_index(np.where(rix < k)[0], a.shape)]
On the OP's array:
>>> topk_indices(a, 5)
array([[0, 0],
[3, 0],
[3, 1],
[4, 0],
[4, 1]])
And on the int array above:
>>> topk_indices(b, 2)
array([[0, 0],
[2, 1],
[2, 2]])
Performance
In terms of performance, @Divakar's answer is between 5x and 10x faster than this, for wide-ranging number of tests of diverse dimensions and parameters. So if you don't think you have ties, or if you don't care, then use it instead of this.
As an example:
a = np.random.randint(0, 10, (1_000_000, 2))
t0 = %timeit -o topk_indices(a, 5)
# 157 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
t1 = %timeit -o divakar_topk_indices(a, 5)
# 25.1 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> t0.average / t1.average
6.24
As a side-note, it offends my sensibility to have to sort an entire array (O(n log n)
) just to find the top-k... A more logical heapq
approach scales better (O(n log k)
), but has larger constant multipliers (just heapq.nlargest(5, a.ravel())
takes 211 ms, and that's just returning the values, not the indices.
Upvotes: 0
Reputation: 221754
np.argpartition
should be a good tool (efficient one) to get those top k
indices without maintaining order. Hence, for array data a
, it would be -
In [43]: np.c_[np.unravel_index(np.argpartition(a.ravel(),-5)[-5:],a.shape)]
Out[43]:
array([[4, 1],
[3, 1],
[4, 0],
[0, 0],
[3, 0]])
To explain, let's break it down into single process steps -
# Get partitioned indices such that the last 5 indices refer to the top 5
# values taken globally from the input array. Refer to docs for more info
# Note that it's global because we will flatten it.
In [9]: np.argpartition(a.ravel(),-5)
Out[9]:
array([14, 24, 2, 3, 4, 23, 22, 7, 8, 9, 19, 18, 17, 13, 12, 11, 6,
1, 5, 10, 21, 16, 20, 0, 15])
# Get last 5 indices, which are the top 5 valued indices
In [10]: np.argpartition(a.ravel(),-5)[-5:]
Out[10]: array([21, 16, 20, 0, 15])
# Convert the global indices back to row-col format
In [11]: np.unravel_index(np.argpartition(a.ravel(),-5)[-5:],a.shape)
Out[11]: (array([4, 3, 4, 0, 3]), array([1, 1, 0, 0, 0]))
# Stack into two-columnar array
In [12]: np.c_[np.unravel_index(np.argpartition(a.ravel(),-5)[-5:],a.shape)]
Out[12]:
array([[4, 1],
[3, 1],
[4, 0],
[0, 0],
[3, 0]])
For matrix data in a
, it would be -
In [48]: np.dstack(np.unravel_index(np.argpartition(a.ravel(),-5)[:,-5:],a.shape))
Out[48]:
array([[[4, 1],
[3, 1],
[4, 0],
[0, 0],
[3, 0]]])
So, compared to the array, the only difference is with the usage of np.dstack
, because with matrix data, the data always stays as 2D.
Notice that these are your results from the last 5
rows.
Upvotes: 2
Reputation: 25269
Your sample:
n = np.array([[0.17542851, 0.13199346, 0.01579704, 0.01429822, 0.01302919],
[0.13279703, 0.12444886, 0.04742024, 0.03114371, 0.02623729],
[0.13502306, 0.07815065, 0.07291175, 0.03690815, 0.02163695],
[0.19032505, 0.15853737, 0.05889324, 0.02791679, 0.02699252],
[0.1695696 , 0.14538635, 0.07127667, 0.04997876, 0.02580234]])
Your output is not top 5 values' indice. Top 5 values are
array([0.14538635, 0.15853737, 0.1695696 , 0.17542851, 0.19032505])
To get their indices: sort
and using isin
to flag their location True
. Finally, use argwhere
to get their posistion
np.argwhere(np.isin(n, np.sort(n, axis=None)[-5:]))
Out[324]:
array([[0, 0],
[3, 0],
[3, 1],
[4, 0],
[4, 1]], dtype=int32)
Upvotes: 2
Reputation: 114035
Assuming you have a list of lists:
In [112]: M
Out[112]:
[[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]]
In [113]: heapq.nlargest(5, ((r,c) for r in range(len(M)) for c in range(len(M[r]))), key=lambda t: M[t[0]][t[1]])
Out[113]: [(4, 4), (4, 3), (4, 2), (4, 1), (4, 0)]
Don't forget to import heapq
Upvotes: 1