How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Question

To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of demonstration I created a random LIL sparse matrix and converted it to a Numpy array (pure python data types would have made a better sense!) using different methods.

from __future__ import print_function
from scipy.sparse import rand, csr_matrix, lil_matrix
import numpy as np

dim = 1000
lil = rand(dim, dim, density=0.01, format='lil', dtype=np.float32, random_state=0)
print('number of nonzero elements:', lil.nnz)
arr = np.zeros(shape=(dim,dim), dtype=float)

number of nonzero elements: 10000

Reading by indexing

%%timeit -n3
for i in xrange(dim):
    for j in xrange(dim):
        arr[i,j] = lil[i,j]

3 loops, best of 3: 6.42 s per loop

Using the `nonzero()` method

%%timeit -n3
nnz = lil.nonzero() # indices of nonzero values
for i, j in zip(nnz[0], nnz[1]):
    arr[i,j] = lil[i,j]

3 loops, best of 3: 75.8 ms per loop

Using the built-in method to convert directly to array

This one is not a general solution for reading the matrix data, so it does not count as a solution.

%timeit -n3 arr = lil.toarray()

3 loops, best of 3: 7.85 ms per loop

Reading Scipy sparse matrices with these methods is not efficient at all. Is there any faster way to read these matrices?

Thoran · Accepted Answer

Try reading the raw data. Scipy sparse matrices are stored in Numpy ndarrays each with different format.

Reading the raw data of LIL sparse matrix

%%timeit -n3
for i, (row, data) in enumerate(zip(lil.rows, lil.data)):
    for j, val in zip(row, data):
        arr[i,j] = val

3 loops, best of 3: 4.61 ms per loop

Reading the raw data of CSR sparse matrix

For csr matrix it is a bit less pythonic to read from raw data but it is worth the speed.

csr = lil.tocsr()

%%timeit -n3
start = 0
for i, end in enumerate(csr.indptr[1:]):
    for j, val in zip(csr.indices[start:end], csr.data[start:end]):
        arr[i,j] = val
    start = end

3 loops, best of 3: 8.14 ms per loop

Similar approach is used in this DBSCAN implementation.

Reading the raw data of COO sparse matrix

%%timeit -n3
for i,j,d in zip(coo.row, coo.col, coo.data):
    arr[i,j] = d

3 loops, best of 3: 5.97 ms per loop

Based on these limited tests:

COO matrix: cleanest
LIL matrix: fastest
CSR matrix: slowest and ugliest. The only good side is that conversion to/from CSR is extremely fast.

Edit: from @hpaulj I added COO matrix to have all the methods in one place.

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Reading by indexing

Using the `nonzero()` method

Using the built-in method to convert directly to array

Answers (2)

Reading the raw data of LIL sparse matrix

Reading the raw data of CSR sparse matrix

Reading the raw data of COO sparse matrix

Related Questions

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Reading by indexing

Using the nonzero() method

Using the built-in method to convert directly to array

Answers (2)

Reading the raw data of LIL sparse matrix

Reading the raw data of CSR sparse matrix

Reading the raw data of COO sparse matrix

Related Questions

Using the `nonzero()` method