glaoye
glaoye

Reputation: 1

Why is each element in a sparse csc matrix 8 bytes?

For example, if I initially have a dense matrix:

A = numpy.array([[0, 0],[0, 1]])

and then convert it to a csc sparse matrix using csc_matrix(A). The matrix is then stored as:

(1, 1)    1
#(row, column)   val

which comprises of three values. Why is the size of the sparse matrix only 8 bytes, even though the computer is essentially storing 3 values? Surely the size of the matrix would be a least 12 bytes, since an integer usually holds 4 bytes.

Upvotes: 0

Views: 159

Answers (1)

BoarGules
BoarGules

Reputation: 16941

I don't agree that the size of the sparse matrix is 8 bytes. I may be missing something, but if I do this, I get a very different answer:

>>> import sys
>>> import numpy
>>> from scipy import sparse
>>> A = numpy.array([[0, 0],[0, 1]])
>>> s = sparse.csc_matrix(A)
>>> s
<2x2 sparse matrix of type '<class 'numpy.int32'>'
    with 1 stored elements in Compressed Sparse Column format>
>>> sys.getsizeof(s)
56

This is the size of the data structure in memory and I assure you that it is accurate. Python must know how big it is, because it does the memory allocation.

If, on the other hand, you use s.data.nbytes:

>>> s.data.nbytes       
4

This gives the expected answer of 4. It is expected because s reports itself as having one stored element of type int32. The value returned, according to the docs,

does not include memory consumed by non-element attributes of the array object.

This is not a more accurate result, just an answer to a different question, as 35421869 makes clear.

I can't explain why you report a value of 8 bytes when the result 4 is clearly correct. One possibility is that numpy.array([[0, 0],[0, 1]]) is not in fact what was actually converted to the sparse array. Where did the value 5 come from? The value of 8 is consistent with a beginning value of numpy.array([[0, 0],[0, 5.0]]).

Your figure of 12 bytes is based on two unmet expectations.

  1. It is possible to represent a sparse matrix as a list of triples (row, column, value). And that is in fact how a COO-matrix is stored, at least in principle. But CSC stands for compressed sparse column and so there are fewer explicit column indexes than in a COO-matrix. This Wikipedia article provides a lucid explanation of how the storage actually works.
  2. nbytes does not report the total memory cost of storing the elements of the matrix. It reports a numpy invariant (over many different kinds of matrix) x.nbytes == np.prod(x.shape) * x.itemsize. This is an important quantity because the explicitly stored elements of the matrix form its biggest subsidiary data structure and must be allocated in contiguous memory.

Upvotes: 1

Related Questions