Reputation: 596
If I want to instantiate a large boolean sparse matrix to assign values at certain indices later, what's the best way to initialize it?
For example, if I want to initialize a 20000000 X 7000 logical sparse matrix on MATLAB with a 10000 filled elements (without mentioning the location of non-zero elements), I would use the following syntax:
Matrix=logical(sparse([],[],[],20000000,7000,10000))
I have no speed constraints on assigning the non-zero values later.
On Python, if I initialize it as a CSR matrix, the creation of the matrix is very fast.
Matrix=csr_matrix((20000000, 7000), dtype=bool)
CPU times: user 860 µs, sys: 2.43 ms, total: 3.29 ms
Wall time: 9.72 ms
However, when I cannot efficiently assign values to a CSR_Matrix, the operation is very slow and you see the inbuilt warning.
If I try initializing it as a LIL matrix:
Matrix=lil_matrix((20000000, 7000), dtype=bool)
CPU times: user 12.4 s, sys: 624 ms, total: 13 s
Wall time: 13 s
or converting the csr_matrix to a lil_matrix:
Matrix=csr_matrix((20000000, 7000), dtype=bool)
Matrix=Matrix.tolil()
CPU times: user 26.8 s, sys: 734 ms, total: 27.5 s
Wall time: 27.5 s
The initialization takes significantly more time.
Is there any way to speed up the initialization of the LIL matrix? If not, what sparse matrix format can I use to speed up the assignment of non-zero elements to such matrices?
Upvotes: 2
Views: 797
Reputation: 231385
I used MATLAB sparse quite a bit years ago. Back then you created a sparse matrix with
S = sparse(i,j,v,m,n)
where i,j,v
where matrices identifying all of the nonzero values. That extra nz
parameter that preallocates 'space' for more nonzeros did not exist.
In scipy
, the equivalent is
S = sparse.csc_matrix((v, (i,j)), m, n)
Again, v,i,j
are fully defined arrays. There isn't any nz
preallocation option. In fact given how attributes are stored, I don't see how preallocation would work or be beneficial.
As you found out, trying to define nonzero values with the csc/csr
format is slow and produces the warning. lil/dok
are designed to make iterative addition faster.
csr
creation time depends on the number of initial nonzero values, and only marginally on the shape (the indptr
array size depends on the number of rows). Normally we don't worry about initialization time for a lil
, but with 20000000 rows, I can see why it would take time. It has to make two object dtype arrays, with empty list elements.
Anyway, try avoid incremental definition. Create the i,j,v
arrays from your source, and then build the matrix.
Upvotes: 2
Reputation: 33532
If you need general incremental indexed-access, dok_matrix is probably your best bet.
It's common to use this one for construction (where it can shine in some cases) before converting to something else like csc, csr (which are usually needed for algebraic operations).
Edit: Most of the below stuff focuses on the accumulated time needed for init + filling + whatever is done after that.
In terms of your case: dok_matrix
init should be quite instant.
...
Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.
That being said, it also depends on your workflow and code which was omitted. Given some structure (python-)loopless task-dependent workflows surely can beat generic (python-)looping adding one element at a time. Often, this involves coo_matrix
.
In your case for some workflows: you don't have any init-time at all as you don't create a matrix a-priori, but only collect whatever is needed before creating the matrix in one batch. Not sure how that fits into your computational model (which is a bit strange: init time-restricted; further usage is free)
Upvotes: 2