Reputation: 1938
I have a scipy sparse matrix in CSR format. It's 72665x72665 so it's impractical to convert this matrix to a dense matrix to perform operations on (the dense representation of this matrix is like 40 gigs). The matrix is symmetric, and has about 82 million non-zero entries (~1.5%).
What I would like to be able to do is, for each row, I want to get the indices of the largest N values. If this were a numpy array, I would use np.argpartition
to do it like so:
for row in matrix:
top_n_idx = np.argpartition(row,-n)[-n:]
Is there something similar to this I can do for a sparse matrix?
Upvotes: 4
Views: 2720
Reputation: 950
Quick and dirty:
Find top n values:
np.sort(matrix[i,:].data)[-n:]
Find indices for top n values:
matrix[i,:].nonzero()[1][np.argsort(matrix[i,:].data)[-n:]]
Upvotes: 0
Reputation: 3831
Improve from @Paul Panzer's solution. Now it can handle the case when any row has less than n values.
def top_n_idx_sparse(matrix, n):
"""Return index of top n values in each row of a sparse matrix."""
top_n_idx = []
for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
n_row_pick = min(n, ri - le)
top_n_idx.append(
matrix.indices[
le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]
]
)
return top_n_idx
What it does?
The matrix.indptr
gives the indices of the beginning of each row stored in the data
array. So (lr, ri)
is the range of data indices of non-zero values in each row.
matrix.data[le:ri]
gives the non-zero values of the row. Passing that to np.argpartition(..., -n_row_pick)
gives you the local indices that will sort the row for the largest n_row_pick
elements from the back. [-n_row_pick:]
select those local indices. Then le +
shift those local indices back to the indices in the data array. Finally, pass that to matrix.indices
to get the largest n values indices in the matrix space.
Upvotes: 7
Reputation: 53119
Directly using the CSR
format and assuming there are enough positive nonzeros in each row you can write:
for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
top_n_idx = matrix.indices[le + np.argpartition(matrix.data[le:ri], -n)[-n:]]
Upvotes: 2