Reputation: 950
In scipy
, when I multiply a slice of a sparse matrix with an array containing only zeros, the result is a matrix that is less or equally sparse than before, even though it should be more or equally sparse. The same holds for setting parts of the matrix to 0 or False:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix as csr
>>> M = csr(np.random.random((8,8))>0.9)
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 6 stored elements in Compressed Sparse Row format>
>>> M[:,0] = False
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
>>> M[:,0].multiply(np.array([[False] for i in xrange(8)]))
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
This is actually computationally expensive for large matrices, because it iterates over all cells in the slice, not just the nonzero ones.
From a mathematical / logical point of view, when multiplying a sparse matrix or vector, all empty cells are certain to remain empty as 0*x == 0
. The same holds for setting to zero: zero-cells do not need to be explicitely set to zero.
What is the best way to deal with this?
I am using scipy version 0.17.0
Upvotes: 0
Views: 668
Reputation: 231385
To recreate your code (using int
type for a more compact display):
In [16]: M = sparse.csr_matrix(np.random.random((8,8))>.7).astype(int)
In [17]: M
Out[17]:
<8x8 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [18]: M.A
Out[18]:
array([[0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 1, 1, 1, 1, 0, 1]])
In [19]: M.tolil().data # show nonzero values by row
Out[19]:
array([list([1, 1]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
Setting a row (or column) explicitly. Note the efficiency warning:
In [20]: M[0,:] = 0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [21]: M.tolil().data
Out[21]:
array([list([0, 0, 0, 0, 0, 0, 0, 0]), list([1, 1]), list([1]),
list([1, 1]), list([]), list([1, 1]), list([1, 1]),
list([1, 1, 1, 1, 1, 1])], dtype=object)
So yes it has set all values in the row to the specified value. And it doesn't attempt to distinguish between setting 0s as opposed to 1s. You can see the code used in M.__setitem__
and M._set_many
(this is where the efficiency warning is generated).
As @jakevpd shows, you need to explicitly tell it to eliminate excess 0s. It does not attempt to do that during the assignment.
In [22]: M.eliminate_zeros()
In [23]: M.tolil().data
Out[23]:
array([list([]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
In general setting values of a matrix explicitly is discouraged, especially with csr
. It's not even allowed with coo
. lil
is the recommended format if you need to do that.
In [24]: Ml = M.tolil()
In [25]: Ml[1,:] = 0
In [26]: Ml.data
Out[26]:
array([list([]), list([]), list([1]), list([1, 1]), list([]), list([1, 1]),
list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
lil
does take care to eliminate 0s.
Multiplication of a row by an array of 0s does not change the sparsity. Nor does it act in-place. It produces a new matrix:
In [29]: M[1,:].multiply(np.zeros((1,8)))
Out[29]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
In [30]: _.A
Out[30]: array([[ 0., 0., 0., 0., 0., 0., 0., 0.]])
In [31]: M[1,:].A
Out[31]: array([[1, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)
Multiplication with a sparse matrix does eliminate 0s (again, not in-place):
In [32]: M[1,:].multiply(sparse.csr_matrix(np.zeros((1,8))))
Out[32]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
(Notice also the different in format between Out[29]
and Out[32]
.)
As a general rule, multiplication, both element-wise and matrix, is the most efficient operation with csr
matrices, especially if the other
is also sparse. In fact row/column sums are performed with matrix multiplication, as is advanced
indexing.
Upvotes: 1
Reputation: 86328
In working with sparse matrices, changing the sparsity pattern is generally a very expensive operation, and so scipy does not do this silently.
If you want to remove explicitly stored zeros from a sparse matrix, you should use the eliminate_zeros()
method; for example:
>>> M = csr(np.random.random((1000,1000))>0.9, dtype=float)
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M[:, 0] *= 0
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M.eliminate_zeros()
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99657 stored elements in Compressed Sparse Row format>
Scipy could call the eliminate_zeros
routine automatically after doing this kind of operation, but the developers chose to give the user more flexibility and control when doing something as expensive as changing the sparsity structure.
Upvotes: 2