How to ignore zeros when calculating correlations between columns for sparse matrix in scipy

Question

I have a sparse csr matrix of size 1Million rows * 1500 columns. I want to calculate the correlation between the columns.

def corrcoef_csr(x, axis=0):
    '''correlation matrix, return type ndarray'''
    covx = cov_csr(x, axis=axis)
    stdx = np.sqrt(np.diag(covx))[np.newaxis,:]
    return covx/(stdx.T * stdx)

def cov_csr(x, axis=0):
    '''return covariance matrix, assumes column variable
    return type ndarray'''
    meanx = x.sum(axis=axis)/float(x.shape[axis])
    if axis == 0:
        return np.array((x.T*x)/x.shape[axis] - meanx.T*meanx)
    else:
        return np.array((x*x.T)/x.shape[axis] - meanx*meanx.T)

I am calculating currently correlation using corrcoef_csr(ip_matrix). But, I want to ignore the entries where both the columns are zeros while calculating the correlation between those columns.

Any idea, how could I do that?

How to ignore zeros when calculating correlations between columns for sparse matrix in scipy

Answers (1)

Related Questions