Reputation: 668
I process rather large matrices in Python/Scipy. I need to extract rows from large matrix (which is loaded to coo_matrix) and use them as diagonal elements. Currently I do that in the following fashion:
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A[i,:].todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csc")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
What I see from the profile
output is that most of the time is consumed by get_csr_submatrix
function while extracting diag_elems
. That makes me think that I use either inefficient sparse representation of initial data or wrong way of extracting row from a sparse matrix. Can you suggest a better way to extract a row from a sparse matrix and represent it in a diagonal form?
EDIT
The following variant removes bottleneck from the row extraction (notice that simple changing 'csc'
to csr
is not sufficient, A[i,:]
must be replaced with A.getrow(i)
as well). However the main question is how to omit the materialization (.todense()
) and create the diagonal matrix from the sparse representation of the row.
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A.getrow(i).todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csr")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
If I create DIAgonal matrix from 1-row CSR matrix directly, as follows:
diag_elems = A.getrow(i)
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1])
then I can neither specify format="csc"
argument, nor convert ith_diags
to CSC format:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/profile.py", line 70, in run
prof = prof.run(statement)
File "/usr/local/lib/python2.6/profile.py", line 456, in run
return self.runctx(cmd, dict, dict)
File "/usr/local/lib/python2.6/profile.py", line 462, in runctx
exec cmd in globals, locals
File "<string>", line 1, in <module>
File "<stdin>", line 4, in computation
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/construct.py", line 56, in spdiags
return dia_matrix((data, diags), shape=(m,n)).asformat(format)
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/base.py", line 211, in asformat
return getattr(self,'to' + format)()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/dia.py", line 173, in tocsc
return self.tocoo().tocsc()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/coo.py", line 263, in tocsc
data = np.empty(self.nnz, dtype=upcast(self.dtype))
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/sputils.py", line 47, in upcast
raise TypeError,'no supported conversion for types: %s' % args
TypeError: no supported conversion for types: object`
Upvotes: 5
Views: 2679
Reputation: 782
Here's what I came up with:
def computation(A):
for i in range(A.shape[0]):
idx_begin = A.indptr[i]
idx_end = A.indptr[i+1]
row_nnz = idx_end - idx_begin
diag_elems = A.data[idx_begin:idx_end]
diag_indices = A.indices[idx_begin:idx_end]
ith_diag = sparse.csc_matrix((diag_elems, (diag_indices, diag_indices)),shape=(A.shape[1], A.shape[1]))
ith_diag.eliminate_zeros()
Python profiler said 1.464 seconds versus 5.574 seconds before. It takes advantage of the underlying dense arrays (indptr, indices, data) that define sparse matrices. Here's my crash course: A.indptr[i]:A.indptr[i+1] defines which elements in the dense arrays correspond to the non-zero values in row i. A.data is a dense 1d array of non-zero the values of A and A.indptr is the column where those values go.
I would do some more testing to make very certain this does the same thing as before. I only checked a few cases.
Upvotes: 3