Cong Hui
Cong Hui

Reputation: 633

Modify scipy sparse matrix in place

Basically, I am just trying to do a simple matrix multiplication, specifically, extract each column of it and normalize it by dividing it with its length.

    #csc sparse matrix
    self.__WeightMatrix__ = self.__WeightMatrix__.tocsc()
    #iterate through columns
    for Col in xrange(self.__WeightMatrix__.shape[1]):
       Column = self.__WeightMatrix__[:,Col].data
       List = [x**2 for x in Column]
       #get the column length
       Len = math.sqrt(sum(List))
       #here I assumed dot(number,Column) would do a basic scalar product
       dot((1/Len),Column)
       #now what? how do I update the original column of the matrix, everything that have been returned are copies, which drove me nuts and missed pointers so much

I've searched through the scipy sparse matrix documentations and got no useful information. I was hoping for a function to return a pointer/reference to the matrix so that I can directly modify its value. Thanks

Upvotes: 3

Views: 3641

Answers (1)

Jaime
Jaime

Reputation: 67417

In CSC format you have two writable attributes, data and indices, which hold the non-zero entries of your matrix and the corresponding row indices. You can use these to your advantage as follows:

def sparse_row_normalize(sps_mat) :
    if sps_mat.format != 'csc' :
        msg = 'Can only row-normalize in place with csc format, not {0}.'
        msg = msg.format(sps_mat.format)
        raise ValueError(msg)
    row_norm = np.sqrt(np.bincount(sps_mat.indices, weights=mat.data * mat_data))
    sps_mat.data /= np.take(row_norm, sps_mat.indices)

To see that it actually works:

>>> mat = scipy.sparse.rand(4, 4, density=0.5, format='csc')
>>> mat.toarray()
array([[ 0.        ,  0.        ,  0.58931687,  0.31070526],
       [ 0.24024639,  0.02767106,  0.22635696,  0.85971295],
       [ 0.        ,  0.        ,  0.13613897,  0.        ],
       [ 0.        ,  0.13766507,  0.        ,  0.        ]])
>>> mat.toarray() / np.sqrt(np.sum(mat.toarray()**2, axis=1))[:, None]
array([[ 0.        ,  0.        ,  0.88458487,  0.46637926],
       [ 0.26076366,  0.03003419,  0.24568806,  0.93313324],
       [ 0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ]])
>>> sparse_row_normalize(mat)
>>> mat.toarray()
array([[ 0.        ,  0.        ,  0.88458487,  0.46637926],
       [ 0.26076366,  0.03003419,  0.24568806,  0.93313324],
       [ 0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ]])

And it is also numpy fast, no Python loops spoiling the fun:

In [2]: mat = scipy.sparse.rand(10000, 10000, density=0.005, format='csc')

In [3]: mat
Out[3]: 
<10000x10000 sparse matrix of type '<type 'numpy.float64'>'
    with 500000 stored elements in Compressed Sparse Column format>

In [4]: %timeit sparse_row_normalize(mat)
100 loops, best of 3: 14.1 ms per loop

Upvotes: 6

Related Questions