Reputation: 95
I have around 10,000 sparse matrices each with size 50,000x5 with 0.0004 density on average. For each loop (10000 times), I'm calculating numpy array and converting it into csr_matrix and appending that to a list. But memory consumption is as high as appending numpy arrays but not as appending csr_matrices.
How to reduce the memory consumption while having these 10K sparse matrices in memory for further computations?
Sample code:
from scipy.sparse import csr_matrix
import numpy as np
sparse_matrices = []
for i in range(10000):
np_array = get_np_array()
sparse_matrix = csr_matrix(np_array)
sparse_matrices.append(sparse_matrix)
print np_array.nbytes, sparse_matrix.data.nbytes, repr(sparse_matrix)
Would outputs something similar which makes it clear that I'm appending compressed matrices. But still, the memory grows as same as appending numpy matrices.
1987520 520 <49688x5 sparse matrix of type '<type 'numpy.float64'>'
with 65 stored elements in Compressed Sparse Row format>
1987520 512 <49688x5 sparse matrix of type '<type 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
Just realised that if I use coo_matrix
instead of csr_matrix
, memory consumption is reasonable. If that is csr_matrix
memory's around ~8gb.
Upvotes: 1
Views: 950
Reputation: 231335
For the matrix:
<49688x5 sparse matrix of type '<type 'numpy.float64'>'
with 65 stored elements in Compressed Sparse Row format>
in coo
format, the key attributes are row
, col
and data
, all with 65 elements. data
is float, the others integers (row and column indices).
In csr
format the row
attribute is replaced with indptr
, which has one value per row (plus 1?). With this shape indptr
is 49688 elements long. If it was csc
format indptr
would only be 5 elements.
csr
usually is more compact that coo
. But in your case there are many blank rows; so it is much larger. csr
will be especially compact if it is a single row matrix; and not compact at all if it is a column vector.
Upvotes: 1