Which format of scipy.sparse is best for this type of matrix generation and use?

Question

I have a data file that encodes information about nonzero elements of a large sparse boolean matrix. This matrix does not have any particular structure, i.e. it's not diagonal or block etc. Each row of the file determines one element. Right now I use the following loop to populate the matrix:

from scipy.sparse import dok_matrix

nRows = 30000
nCols = 600000

data = dok_matrix((nRows,nCols), dtype=np.int8)

with open('input.txt','r') as fraw:
    for line in fraw:
        ## Figure out iRow and iCol to set to 1 from line
        data[iRow,iCol] = 1

This is working but it's very slow. Is there a different type of scipy.sparse matrix that is more optimal?

'Optimal' means the speed of both matrix generation and access of row and column blocks of the matrix, e.g. vector operations like

someRows = data[rowIndex1:rowIndex2,]
someColumns = data[,colIndex1:colIndex2]

Does the answer change if memory is more important than speed?

Thx

hpaulj · Accepted Answer

For incremental additions like this dok is as good as it gets. It is really a dictionary that stores the value at a tuple: (iRow,iCol). So storing and fetching depends on the basic Python dictinary efficiency.

The only one that is good for incremental additions is lil which stores the data as 2 lists of lists.

Another approach is to collect you data in 3 lists, and construct the matrix at the end. A start for that is coo and its (data,(i,j)) input method.

Dense numpy arrays are loaded from a file with genfromtxt or loadtxt. Both of those read the file, line by line, collecting values in a list of lists, with array creation at the end.

What's the speed like if you just read the file and parse the values - without saving anything to the dok? That would give you an idea of how much time is actually spent adding the data to the matrix.

Another possbility is to store the values directly to a generic dictionary, and use that to create the dok.

In [60]: adict=dict()

In [61]: for i in np.random.randint(1000,size=(2000,)):
    adict[(i,i)]=1
   ....:     

In [62]: dd=sparse.dok_matrix((1000,1000),dtype=np.int8)

In [63]: dd.update(adict)

In [64]: dd.A
Out[64]: 
array([[1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]], dtype=int8)

That is quite a bit faster than directly updating the dok.

In [66]: %%timeit 
for i in np.random.randint(1000,size=(2000,)):
    adict[(i,i)]=1
dd.update(adict)
   ....: 
1000 loops, best of 3: 1.32 ms per loop

In [67]: %%timeit 
for i in np.random.randint(1000,size=(2000,)):
    dd[i,i]=1
   ....: 
10 loops, best of 3: 35.6 ms per loop

There must be some overhead in updating the dok that I wasn't taking into account.

I just realized that I'd suggest this update method once before:

https://stackoverflow.com/a/27771335/901925 Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

Which format of scipy.sparse is best for this type of matrix generation and use?

Answers (1)

Related Questions