Reputation: 5927
I have a data file that encodes information about nonzero elements of a large sparse boolean matrix. This matrix does not have any particular structure, i.e. it's not diagonal or block etc. Each row of the file determines one element. Right now I use the following loop to populate the matrix:
from scipy.sparse import dok_matrix
nRows = 30000
nCols = 600000
data = dok_matrix((nRows,nCols), dtype=np.int8)
with open('input.txt','r') as fraw:
for line in fraw:
## Figure out iRow and iCol to set to 1 from line
data[iRow,iCol] = 1
This is working but it's very slow. Is there a different type of scipy.sparse
matrix that is more optimal?
'Optimal' means the speed of both matrix generation and access of row and column blocks of the matrix, e.g. vector operations like
someRows = data[rowIndex1:rowIndex2,]
someColumns = data[,colIndex1:colIndex2]
Does the answer change if memory is more important than speed?
Thx
Upvotes: 0
Views: 372
Reputation: 231385
For incremental additions like this dok
is as good as it gets. It is really a dictionary that stores the value at a tuple: (iRow,iCol)
. So storing and fetching depends on the basic Python dictinary efficiency.
The only one that is good for incremental additions is lil
which stores the data as 2 lists of lists.
Another approach is to collect you data in 3 lists, and construct the matrix at the end. A start for that is coo
and its (data,(i,j))
input method.
Dense numpy
arrays are loaded from a file with genfromtxt
or loadtxt
. Both of those read the file, line by line, collecting values in a list of lists, with array creation at the end.
What's the speed like if you just read the file and parse the values - without saving anything to the dok
? That would give you an idea of how much time is actually spent adding the data to the matrix.
Another possbility is to store the values directly to a generic dictionary, and use that to create the dok
.
In [60]: adict=dict()
In [61]: for i in np.random.randint(1000,size=(2000,)):
adict[(i,i)]=1
....:
In [62]: dd=sparse.dok_matrix((1000,1000),dtype=np.int8)
In [63]: dd.update(adict)
In [64]: dd.A
Out[64]:
array([[1, 0, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0],
[0, 0, 1, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 1, 0, 0],
[0, 0, 0, ..., 0, 1, 0],
[0, 0, 0, ..., 0, 0, 1]], dtype=int8)
That is quite a bit faster than directly updating the dok
.
In [66]: %%timeit
for i in np.random.randint(1000,size=(2000,)):
adict[(i,i)]=1
dd.update(adict)
....:
1000 loops, best of 3: 1.32 ms per loop
In [67]: %%timeit
for i in np.random.randint(1000,size=(2000,)):
dd[i,i]=1
....:
10 loops, best of 3: 35.6 ms per loop
There must be some overhead in updating the dok
that I wasn't taking into account.
I just realized that I'd suggest this update
method once before:
https://stackoverflow.com/a/27771335/901925
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Upvotes: 2