Reputation: 13924
I have a scipy.sparse.csc_matrix that I am trying to transform into an array with scipy.sparse.csc_matrix.toarray()
. When I use the function for a small dataset it works fine. However, when I use it for a large dataset, the python interpreter immediately crashes upon calling the function and the window closes without an error message. The matrix I am trying to transform into an array was created with sklearn.feature_extraction.text.CountVectorizer
. I am running python 2.7.3 on Ubuntu 12.04. To complicate matters, when I try to run the script from the terminal in order to save any error message, the log records no error message and indeed stops much earlier in the script (despite being complete if toarray()
is not called).
Upvotes: 3
Views: 4963
Reputation: 1
just delete .toarray, and use sparse matrix as input to classifier, it works just fine
Upvotes: -1
Reputation: 40169
You cannot call toarray
on a large sparse matrix as it will try to store all the values (including the zeros) explicitly in a continuous chunk of memory.
Let's take and example, assume you have sparse matrix A:
>>> A.shape
(10000, 100000)
>>> A.nnz # non zero entries
47231
>>> A.dtype.itemsize
8
The size of the non-zeros data in MB is:
>>> (A.nnz * A.dtype.itemsize) / 1e6
0.377848
You can check that this matches the size of the data
array of the sparse matrix data-structure:
>>> A.data / 1e6
0.377848
Depending on the kind of sparse matrix data-structure (CSR, CSC, COO...), it also stores the location of the non-zero entries in various ways. In general this approximately doubles the memory usage. So the total memory used by A is in the order of 700kB.
Converting to the contiguous array representation would materialize all the zeros in memory and the resulting size would be:
>>> A.shape[0] * A.shape[1] * A.dtype.itemsize / 1e6
8000.0
That's 8GB for this example, compared to less than 1MB for the original sparse representation.
Upvotes: 3