Reputation: 1938

Finding the top n values in a row of a scipy sparse matrix

I have a scipy sparse matrix in CSR format. It's 72665x72665 so it's impractical to convert this matrix to a dense matrix to perform operations on (the dense representation of this matrix is like 40 gigs). The matrix is symmetric, and has about 82 million non-zero entries (~1.5%).

What I would like to be able to do is, for each row, I want to get the indices of the largest N values. If this were a numpy array, I would use np.argpartition to do it like so:

    for row in matrix:
        top_n_idx = np.argpartition(row,-n)[-n:]

Is there something similar to this I can do for a sparse matrix?

Upvotes: 4

Answers (3)

Radio Controlled

Reputation: 950

Quick and dirty:

Find top n values:

np.sort(matrix[i,:].data)[-n:]

Find indices for top n values:

matrix[i,:].nonzero()[1][np.argsort(matrix[i,:].data)[-n:]]

Upvotes: 0

Louis Yang

Reputation: 3831

Improve from @Paul Panzer's solution. Now it can handle the case when any row has less than n values.

def top_n_idx_sparse(matrix, n):
    """Return index of top n values in each row of a sparse matrix."""
    top_n_idx = []
    for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
        n_row_pick = min(n, ri - le)
        top_n_idx.append(
            matrix.indices[
                le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]
            ]
        )
    return top_n_idx

What it does?

The matrix.indptr gives the indices of the beginning of each row stored in the data array. So (lr, ri) is the range of data indices of non-zero values in each row. matrix.data[le:ri] gives the non-zero values of the row. Passing that to np.argpartition(..., -n_row_pick) gives you the local indices that will sort the row for the largest n_row_pick elements from the back. [-n_row_pick:] select those local indices. Then le + shift those local indices back to the indices in the data array. Finally, pass that to matrix.indices to get the largest n values indices in the matrix space.

Upvotes: 7

Paul Panzer

Reputation: 53119

Directly using the CSR format and assuming there are enough positive nonzeros in each row you can write:

for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
    top_n_idx = matrix.indices[le + np.argpartition(matrix.data[le:ri], -n)[-n:]]

Upvotes: 2

Finding the top n values in a row of a scipy sparse matrix

Answers (3)

Related Questions