Michael
Michael

Reputation: 13924

Error Converting Sparse Matrix to Array with scipy.sparse.csc_matrix.toarray()

I have a scipy.sparse.csc_matrix that I am trying to transform into an array with scipy.sparse.csc_matrix.toarray(). When I use the function for a small dataset it works fine. However, when I use it for a large dataset, the python interpreter immediately crashes upon calling the function and the window closes without an error message. The matrix I am trying to transform into an array was created with sklearn.feature_extraction.text.CountVectorizer. I am running python 2.7.3 on Ubuntu 12.04. To complicate matters, when I try to run the script from the terminal in order to save any error message, the log records no error message and indeed stops much earlier in the script (despite being complete if toarray() is not called).

Upvotes: 3

Views: 4963

Answers (2)

Doni
Doni

Reputation: 1

just delete .toarray, and use sparse matrix as input to classifier, it works just fine

Upvotes: -1

ogrisel
ogrisel

Reputation: 40169

You cannot call toarray on a large sparse matrix as it will try to store all the values (including the zeros) explicitly in a continuous chunk of memory.

Let's take and example, assume you have sparse matrix A:

>>> A.shape
(10000, 100000)
>>> A.nnz              # non zero entries
47231
>>> A.dtype.itemsize
8

The size of the non-zeros data in MB is:

>>> (A.nnz * A.dtype.itemsize) / 1e6
0.377848

You can check that this matches the size of the data array of the sparse matrix data-structure:

>>> A.data / 1e6
0.377848

Depending on the kind of sparse matrix data-structure (CSR, CSC, COO...), it also stores the location of the non-zero entries in various ways. In general this approximately doubles the memory usage. So the total memory used by A is in the order of 700kB.

Converting to the contiguous array representation would materialize all the zeros in memory and the resulting size would be:

>>> A.shape[0] * A.shape[1] * A.dtype.itemsize / 1e6
8000.0

That's 8GB for this example, compared to less than 1MB for the original sparse representation.

Upvotes: 3

Related Questions