Reputation: 289
I have a matrix A in CSC-format, of which I index just a single column
b = A[:,col]
resulting in a (n x 1) matrix. What I want to do is:
v = M * b
where M is a (n x n) matrix in CSR. The result v is a (n x 1) CSR-matrix. I need to iterate the values in v (not including the 0s actually) and retrieve the index of one element meeting a special criteria (note: sparse matrix formats were not chosen to fit that particular operation, but general matrix x matrix-products should be fastest with CSR * CSC, right?)
The problem is, that iterating the entries in the CSR-formatted vector (0 < i < n: v[i,0]) is terribly slow and I actually waste quite some memory since v is not sparse anymore.
Could anyone tell me how to perform these operations as such, that I can quickly iterate over the result vector, keeping the copy-related memory overhead small?
IN: M (CSR-Matrix), A (CSC-Matrix), col_index
v = M * A[:,col_index]
for entries in v:
do stuff
Is it also possible to somehow speed up "advanced" indexing over columns in a CSC-Matrix? At some other point in the code, I have to extract a submatrix of A (cannot be reformulated to allow for slicing, therefore using an index array), that includes a given subset of all columns. A[:,idxlist] takes quite long when line-profiling.
Looking forward to your suggestions
Upvotes: 1
Views: 1889
Reputation: 231335
In a Code Review problem I am exploring ways of speeding up the iteration over the rows of a sparse matrix, https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566#33566
csr
getrow
is surprisingly slow. At least for that small test case it is faster to convert the sparse matrix to a dense array, and use regular numpy indexing (use np.nonzero
to get sparse entries). It is equally fast to convert the matrix to lil
, and do regular Python iteration on zip(X.data, X.rows)
.
My impression is that scipy.sparse
is best for linear algebra problems, and slow for indexing and iteration.
Upvotes: 1
Reputation: 67417
The scipy sparse module is getting better every release, but it is quite obviously work in progress, so there is a lot of optimization you can do by accessing the innards of the objects directly. E.g. your case:
>>> a = sps.rand(5, 20, density=0.2, format='csr')
>>> b = sps.rand(20, 1, density=0.2, format='csc')
>>> c = a * b
>>> c.A
array([[ 0.30331594],
[ 0. ],
[ 0.12198742],
[ 0.34350077],
[ 0. ]])
You can get the non-zero entries of c
as c.data
:
>>> c.data
array([ 0.30331594, 0.12198742, 0.34350077])
Getting the corresponding row number is a little trickier. Probably the easiest would be to convert your output to CSC format, since them you have them directly as c.indices
, and c.data
will still be the same as before:
>>> c.tocsc().indices
array([0, 2, 3])
>>> c.tocsc().data
array([ 0.30331594, 0.12198742, 0.34350077])
But you can extract them without doing the conversion if you don't fancy it:
>>> np.where(c.indptr[:-1] != c.indptr[1:])[0]
array([0, 2, 3], dtype=int64)
So if you wanted to find, e.g. the largest value and its row number, you could do:
>>> row_idx = np.where(c.indptr[:-1] != c.indptr[1:])[0]
>>> idx = np.argmax(c.data)
>>> c.data[idx], row_idx[idx]
(0.34350077450601624, 3)
Upvotes: 1