Reputation: 777
I have a sparse csr matrix of size 1Million rows * 1500 columns. I want to calculate the correlation between the columns.
def corrcoef_csr(x, axis=0):
'''correlation matrix, return type ndarray'''
covx = cov_csr(x, axis=axis)
stdx = np.sqrt(np.diag(covx))[np.newaxis,:]
return covx/(stdx.T * stdx)
def cov_csr(x, axis=0):
'''return covariance matrix, assumes column variable
return type ndarray'''
meanx = x.sum(axis=axis)/float(x.shape[axis])
if axis == 0:
return np.array((x.T*x)/x.shape[axis] - meanx.T*meanx)
else:
return np.array((x*x.T)/x.shape[axis] - meanx*meanx.T)
I am calculating currently correlation using corrcoef_csr(ip_matrix). But, I want to ignore the entries where both the columns are zeros while calculating the correlation between those columns.
Any idea, how could I do that?
Upvotes: 2
Views: 3524
Reputation: 13216
You can use numpy non zero to return an array with only the non-zero elements, e.g. xnz = x[numpy.nonzero(x)]
. You need to ensure your use of shape, etc corresponds to the size of the reduced array xnz.shape[axis]
.
You can also use masked arrays or convert zeros to nan
and use np.nanmean
and np.nanstd
. There is an interesting discussion of missing data in general http://docs.scipy.org/doc/numpy-dev/neps/missing-data.html
Upvotes: 1