Reputation: 17359

Fast slicing and multiplication of scipy sparse CSR matrix

I have a scipy sparse CSR matrix of size 2M x 50k with 200M non-zero values (100 per row). I need to slice 120k rows of it by a (randomly distributed) index (which is a pandas Series) and then multiply that submatrix by a sparse vector of size 1x50k (with 100 non-zero values as well).

I do this:

slice = matrix[index.tolist(), :]
result = slice.dot(vector.T).T.toarray()[0]  # returns 1x120k array

The slicing takes 0.7s (slow) and then multiplication takes 0.05s.

Instead, I can multiply the whole matrix first and then slice the result:

result = matrix.dot(vector.T).T.toarray()[0]
result_sliced = result[index.tolist()]  # returns 1x120k array

In this case, multiplication takes 0.65s and then slicing takes 0.015s.

Questions:

Why is slicing of a CSR matrix by rows so slow? Even multiplication of the whole matrix takes less time than it.
Is there a way to achieve the end result faster?

Upvotes: 0

Answers (3)

kemga

Reputation: 21

If you have enough RAM memory you can convert tolil() and slicing should be faster.

Upvotes: 0

Sorig

Reputation: 1211

I ran into the same issue and my solution was to write a row extractor that relies on indexing of numpy arrays rather than sparse matrix multiplications. See my approach here.

Upvotes: 0

hpaulj

Reputation: 231665

I explained in Sparse matrix slicing using list of int that this kind of row indexing is actually performed with matrix multiplication. In effect it constructs a sparse vector with 1's for the desired rows, and does the appropriate dot.

So I'm not surprised that the order of the operations doesn't matter much.

In general sparse matrices are not designed for efficient indexing. They don't, for example, return views. The csr matrix multiplication is one its most efficient operations. Even row or columns sums are performed with matrix multiplication.

Upvotes: 1

Fast slicing and multiplication of scipy sparse CSR matrix

Answers (3)

Related Questions