Reputation: 145
I'm trying to convert this code to use scipy sparse matrices as the real matrix is very large, but am having trouble with it. Please can anyone help?
import numpy as np
G = np.array([[0., 50., 50., 0.],
[10., 0., 10., 0.],
[0., 0., 0., 10.],
[2., 0., 2., 0.]])
s = G.sum(axis=0)
m = np.minimum(G, 1).transpose()
sm = s * m
sm_rnorm = (sm / sm.sum(axis=0))
smm = sm * sm_rnorm
G += smm.transpose()
print(G)
I tried the following:
import numpy as np
from scipy.sparse import csc_matrix
G = np.array([[0.,50.,50.,0.],
[10.,0.,10.,0.],
[0.,0.,0.,10.],
[2.,0.,2.,0.]])
G = csc_matrix(G, dtype=np.float)
s = csc_matrix(G.sum(axis=0))
m = csc_matrix.minimum(G, 1).transpose()
sm = s * m
sm_rnorm = (sm / csc_matrix(sm.sum(axis=0)))
smm = sm * sm_rnorm
G += smm.transpose()
print(G)
...but get ValueError: dimension mismatch
Upvotes: 0
Views: 740
Reputation: 231335
I ran your dense code,
In [224]: G = np.array([[0., 50., 50., 0.],
...: [10., 0., 10., 0.],
...: [0., 0., 0., 10.],
...: [2., 0., 2., 0.]])
...: s = G.sum(axis=0)
...: m = np.minimum(G, 1).transpose()
...: sm = s * m
...: sm_rnorm = (sm / sm.sum(axis=0))
...: smm = sm * sm_rnorm
...:
In [225]: s
Out[225]: array([12., 50., 62., 10.])
In [226]: m
Out[226]:
array([[0., 1., 0., 1.],
[1., 0., 0., 0.],
[1., 1., 0., 1.],
[0., 0., 1., 0.]])
In [227]: sm
Out[227]:
array([[ 0., 50., 0., 10.],
[12., 0., 0., 0.],
[12., 50., 0., 10.],
[ 0., 0., 62., 0.]])
and then started the sparse version:
In [192]: from scipy import sparse
In [228]: Gm = sparse.csr_matrix(G)
In [229]: Gm
Out[229]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 7 stored elements in Compressed Sparse Row format>
In [230]: s_m = Gm.sum(axis=0)
In [231]: s_m
Out[231]: matrix([[12., 50., 62., 10.]])
In [233]: m_m = Gm.minimum(1).T
In [234]: m_m.A
Out[234]:
array([[0., 1., 0., 1.],
[1., 0., 0., 0.],
[1., 1., 0., 1.],
[0., 0., 1., 0.]])
Oops:
In [236]: s_m * m_m
Out[236]: matrix([[112., 74., 10., 74.]])
*
if matrix multiplication for np.matrix
and sparse matrix
In [237]: s.dot(m)
Out[237]: array([112., 74., 10., 74.])
sparse matrix element wise multiplication:
In [242]: sm_m = m_m.multiply(s_m)
In [243]: sm_m.A
Out[243]:
array([[ 0., 50., 0., 10.],
[12., 0., 0., 0.],
[12., 50., 0., 10.],
[ 0., 0., 62., 0.]])
now to match sm_rnorm
:
In [244]: sm_m.sum(axis=0)
Out[244]: matrix([[ 24., 100., 62., 20.]])
In [250]: sm_m / sm_m.sum(axis=0)
Out[250]:
matrix([[0. , 0.5, 0. , 0.5],
[0.5, 0. , 0. , 0. ],
[0.5, 0.5, 0. , 0.5],
[0. , 0. , 1. , 0. ]])
sparse/dense
works elementwise, but sparse/sparse
has problems:
In [252]: sm_m / sparse.csr_matrix(sm_m.sum(axis=0))
----> 1 sm_m / sparse.csr_matrix(sm_m.sum(axis=0))
--> 576 return self._divide(other, true_divide=True)
568 if true_divide and np.can_cast(self.dtype, np.float_):
ValueError: inconsistent shapes
I think this is a matrix division issue, but I'd dig further to be sure.
sm_m.multiply(1 / sm_m.sum(axis=0))
gives a sparse matrix with the right values, but is slower (at least for this example).
smm_m = sm_m.multiply( sm_m / sm_m.sum(axis=0))
matches smm
. And Gm += smm_m
works. Sparse +=
doesn't raise an efficiency error because it doesn't change the sparsity.
So the key issue is keeping matrix multiplication and element multiplication straight (and the the corresponding divisions).
sklearn.utils.sparsefuncs
has some sparse utility functions
The above sm_m
is coo
format array (not sure why):
In [366]: sm_m
Out[366]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 7 stored elements in COOrdinate format>
In [367]: sm_m.A
Out[367]:
array([[ 0., 50., 0., 10.],
[12., 0., 0., 0.],
[12., 50., 0., 10.],
[ 0., 0., 62., 0.]])
convert it to csr
:
In [368]: sm_m1 = sm_m.tocsr()
In [369]: sm_m1
Out[369]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 7 stored elements in Compressed Sparse Row format>
derive the column scaling array:
In [370]: x = sm_m1.sum(axis=0)
In [371]: x
Out[371]: matrix([[ 24., 100., 62., 20.]])
In [372]: x = 1/x.A1 # .A1 makes a 1d array from np.matrix
apply scaling inplace:
In [373]: sklearn.utils.sparsefuncs.inplace_csr_column_scale(sm_m1,x)
In [374]: sm_m1.A
Out[374]:
array([[0. , 0.5, 0. , 0.5],
[0.5, 0. , 0. , 0. ],
[0.5, 0.5, 0. , 0.5],
[0. , 0. , 1. , 0. ]])
That inplace column_scale is simple:
def inplace_csr_column_scale(X, scale):
# ....
X.data *= scale.take(X.indices, mode='clip')
The m_m.multiply(s_m)
step can also be done this way as well:
In [380]: m1_m = m_m.tocsr()
In [381]: sklearn.utils.sparsefuncs.inplace_csr_column_scale(m1_m,s_m.A1)
In [382]: m1_m.A
Out[382]:
array([[ 0., 50., 0., 10.],
[12., 0., 0., 0.],
[12., 50., 0., 10.],
[ 0., 0., 62., 0.]])
I suspect the code can be cleaned up, removing the transpose etc.
Is G
inherently square? I like to use non-square arrays to better keep track of shapes, transposes, and dimensional sums. I tried to expand G
to (5,4), and hit a problem at the s*m
step.
Upvotes: 1