Reputation: 633
Basically, I am just trying to do a simple matrix multiplication, specifically, extract each column of it and normalize it by dividing it with its length.
#csc sparse matrix
self.__WeightMatrix__ = self.__WeightMatrix__.tocsc()
#iterate through columns
for Col in xrange(self.__WeightMatrix__.shape[1]):
Column = self.__WeightMatrix__[:,Col].data
List = [x**2 for x in Column]
#get the column length
Len = math.sqrt(sum(List))
#here I assumed dot(number,Column) would do a basic scalar product
dot((1/Len),Column)
#now what? how do I update the original column of the matrix, everything that have been returned are copies, which drove me nuts and missed pointers so much
I've searched through the scipy sparse matrix documentations and got no useful information. I was hoping for a function to return a pointer/reference to the matrix so that I can directly modify its value. Thanks
Upvotes: 3
Views: 3641
Reputation: 67417
In CSC format you have two writable attributes, data
and indices
, which hold the non-zero entries of your matrix and the corresponding row indices. You can use these to your advantage as follows:
def sparse_row_normalize(sps_mat) :
if sps_mat.format != 'csc' :
msg = 'Can only row-normalize in place with csc format, not {0}.'
msg = msg.format(sps_mat.format)
raise ValueError(msg)
row_norm = np.sqrt(np.bincount(sps_mat.indices, weights=mat.data * mat_data))
sps_mat.data /= np.take(row_norm, sps_mat.indices)
To see that it actually works:
>>> mat = scipy.sparse.rand(4, 4, density=0.5, format='csc')
>>> mat.toarray()
array([[ 0. , 0. , 0.58931687, 0.31070526],
[ 0.24024639, 0.02767106, 0.22635696, 0.85971295],
[ 0. , 0. , 0.13613897, 0. ],
[ 0. , 0.13766507, 0. , 0. ]])
>>> mat.toarray() / np.sqrt(np.sum(mat.toarray()**2, axis=1))[:, None]
array([[ 0. , 0. , 0.88458487, 0.46637926],
[ 0.26076366, 0.03003419, 0.24568806, 0.93313324],
[ 0. , 0. , 1. , 0. ],
[ 0. , 1. , 0. , 0. ]])
>>> sparse_row_normalize(mat)
>>> mat.toarray()
array([[ 0. , 0. , 0.88458487, 0.46637926],
[ 0.26076366, 0.03003419, 0.24568806, 0.93313324],
[ 0. , 0. , 1. , 0. ],
[ 0. , 1. , 0. , 0. ]])
And it is also numpy fast, no Python loops spoiling the fun:
In [2]: mat = scipy.sparse.rand(10000, 10000, density=0.005, format='csc')
In [3]: mat
Out[3]:
<10000x10000 sparse matrix of type '<type 'numpy.float64'>'
with 500000 stored elements in Compressed Sparse Column format>
In [4]: %timeit sparse_row_normalize(mat)
100 loops, best of 3: 14.1 ms per loop
Upvotes: 6