Reputation: 9450
To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of demonstration I created a random LIL sparse matrix and converted it to a Numpy array (pure python data types would have made a better sense!) using different methods.
from __future__ import print_function
from scipy.sparse import rand, csr_matrix, lil_matrix
import numpy as np
dim = 1000
lil = rand(dim, dim, density=0.01, format='lil', dtype=np.float32, random_state=0)
print('number of nonzero elements:', lil.nnz)
arr = np.zeros(shape=(dim,dim), dtype=float)
number of nonzero elements: 10000
%%timeit -n3
for i in xrange(dim):
for j in xrange(dim):
arr[i,j] = lil[i,j]
3 loops, best of 3: 6.42 s per loop
nonzero()
method%%timeit -n3
nnz = lil.nonzero() # indices of nonzero values
for i, j in zip(nnz[0], nnz[1]):
arr[i,j] = lil[i,j]
3 loops, best of 3: 75.8 ms per loop
This one is not a general solution for reading the matrix data, so it does not count as a solution.
%timeit -n3 arr = lil.toarray()
3 loops, best of 3: 7.85 ms per loop
Reading Scipy sparse matrices with these methods is not efficient at all. Is there any faster way to read these matrices?
Upvotes: 4
Views: 2946
Reputation: 9450
Try reading the raw data. Scipy sparse matrices are stored in Numpy ndarrays each with different format.
%%timeit -n3
for i, (row, data) in enumerate(zip(lil.rows, lil.data)):
for j, val in zip(row, data):
arr[i,j] = val
3 loops, best of 3: 4.61 ms per loop
For csr matrix it is a bit less pythonic to read from raw data but it is worth the speed.
csr = lil.tocsr()
%%timeit -n3
start = 0
for i, end in enumerate(csr.indptr[1:]):
for j, val in zip(csr.indices[start:end], csr.data[start:end]):
arr[i,j] = val
start = end
3 loops, best of 3: 8.14 ms per loop
Similar approach is used in this DBSCAN implementation.
%%timeit -n3
for i,j,d in zip(coo.row, coo.col, coo.data):
arr[i,j] = d
3 loops, best of 3: 5.97 ms per loop
Based on these limited tests:
Edit: from @hpaulj I added COO matrix to have all the methods in one place.
Upvotes: 2
Reputation: 231335
A similar question, but dealing setting sparse values, rather than just reading them:
Efficient incremental sparse matrix in python / scipy / numpy
More on accessing values using the underlying representation
Efficiently select random non-zero column from each row of sparse matrix in scipy
Also
why is row indexing of scipy csr matrices slower compared to numpy arrays
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Take a look at what M.nonzero
does:
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])
It converts the matrix to coo
format and returns the .row
, and .col
attributes - after filtering out any 'stray' 0s in the .data
attribute.
So you could skip the middle man and use those attributes directly:
A = lil.tocoo()
for i,j,d in zip(A.row, A.col, A.data):
a[i,j] = d
This is almost as good as the toarray
:
In [595]: %%timeit
.....: aa = M.tocoo()
.....: for i,j,d in zip(aa.row,aa.col,aa.data):
.....: A[i,j]=d
.....:
100 loops, best of 3: 14.3 ms per loop
In [596]: timeit arr=M.toarray()
100 loops, best of 3: 12.3 ms per loop
But if your target is really an array, you don't need to iterate
In [603]: %%timeit
.....: A=np.empty(M.shape,M.dtype)
.....: aa=M.tocoo()
.....: A[aa.row,aa.col]=aa.data
.....:
100 loops, best of 3: 8.22 ms per loop
My times for @Thoran's 2 methods are:
100 loops, best of 3: 5.81 ms per loop
100 loops, best of 3: 17.9 ms per loop
Same ballpark of times.
Upvotes: 2