Reputation: 17359
I have a scipy sparse CSR
matrix of size 2M x 50k with 200M non-zero values (100 per row). I need to slice 120k rows of it by a (randomly distributed) index (which is a pandas Series
) and then multiply that submatrix by a sparse vector of size 1x50k (with 100 non-zero values as well).
I do this:
slice = matrix[index.tolist(), :]
result = slice.dot(vector.T).T.toarray()[0] # returns 1x120k array
The slicing takes 0.7s
(slow) and then multiplication takes 0.05s
.
Instead, I can multiply the whole matrix first and then slice the result:
result = matrix.dot(vector.T).T.toarray()[0]
result_sliced = result[index.tolist()] # returns 1x120k array
In this case, multiplication takes 0.65s
and then slicing takes 0.015s
.
Questions:
Why is slicing of a CSR matrix by rows so slow? Even multiplication of the whole matrix takes less time than it.
Is there a way to achieve the end result faster?
Upvotes: 0
Views: 2619
Reputation: 21
If you have enough RAM memory you can convert tolil() and slicing should be faster.
Upvotes: 0
Reputation: 1211
I ran into the same issue and my solution was to write a row extractor that relies on indexing of numpy arrays rather than sparse matrix multiplications. See my approach here.
Upvotes: 0
Reputation: 231665
I explained in Sparse matrix slicing using list of int that this kind of row indexing is actually performed with matrix multiplication. In effect it constructs a sparse vector with 1's for the desired rows, and does the appropriate dot
.
So I'm not surprised that the order of the operations doesn't matter much.
In general sparse matrices are not designed for efficient indexing. They don't, for example, return views. The csr
matrix multiplication is one its most efficient operations. Even row or columns sums are performed with matrix multiplication.
Upvotes: 1