Is there any memory-saving way to calculate Pearson correlation coefficient of two sparse matrix?

Question

vec1 and vec2 are both 1x200000 sparse matrix. I want to calculate the Pearson correlation coefficient between them (equivalent to scipy.stats.pearsonr). Is there any way?

skalet · Accepted Answer

200k elements is usually not considered a lot, you probably can just convert them to dense matrices and use scipy.stats.pearsonr.

For reference though, see my implementation for sparse vectors below.

Note that the numeric error is quite big, so the last assertion mostly fails. Also note the trick is to put the scalar subtraction outside the sum in order to keep all operations sparse.

import numpy as np
from scipy import sparse
from scipy.stats import pearsonr

def create_sparse_vector(n):
    return sparse.random(n,1)

def dense_pearsonr(x, y): 
    r, p = pearsonr(x.A.squeeze(), y.A.squeeze())
    return r

def sparse_pearsonr(x, y): 
    n = x.shape[0]
    assert(n == y.shape[0])
    mx = x.mean()
    my = y.mean()
    sx = np.sqrt(np.sum(x.multiply(x) - 2*mx*x) + mx**2)
    sy = np.sqrt(np.sum(y.multiply(y) - 2*my*y) + my**2)
    a = np.sum(x.multiply(y)) - n*mx*my
    b = sx*sy
    c = a / b 
    return min(max(c,-1.0),1.0)

N = 200000 
x = sparse.random(N,1)
y = sparse.random(N,1)
r1 = dense_pearsonr(x,y)
r2 = sparse_pearsonr(x,y)
print(r1)
print(r2)
assert(np.isclose(r1,r2)) # Warning: Assertion fails because of too big numerical error

Is there any memory-saving way to calculate Pearson correlation coefficient of two sparse matrix?

Answers (2)

Related Questions