H.Ji
H.Ji

Reputation: 195

Is there any memory-saving way to calculate Pearson correlation coefficient of two sparse matrix?

vec1 and vec2 are both 1x200000 sparse matrix. I want to calculate the Pearson correlation coefficient between them (equivalent to scipy.stats.pearsonr). Is there any way?

Upvotes: 1

Views: 837

Answers (2)

hpaulj
hpaulj

Reputation: 231540

Define 2 sparse matrices:

In [15]: M1 = sparse.random(1,1000)
In [16]: M2 = sparse.random(1,1000)

In [17]: stats.pearsonr(M1.A[0], M2.A[0])
Out[17]: (0.20251345569257148, 1.0257264375653503e-10)

Same value:

In [23]: x=M1.A[0]; y=M2.A[0]
In [24]: x1=x-x.mean(); y1=y-y.mean()
In [25]: (x1*y1).sum()/np.sqrt((x1*x1).sum()*(y1*y1).sum())
Out[25]: 0.20251345569257148

But I can't do that with sparse matrices:

In [27]: M1.mean()
Out[27]: 0.0050088190479221925
In [28]: M1 - M1.mean()
...
NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported

Subtracting a nonzero value from a sparse matrix means that it is no longer sparse.


Ignoring the means:

In [36]: (M1.multiply(M2)).sum()/np.sqrt((M1.multiply(M1)).sum()*(M2.multiply(M2)).sum())
Out[36]: 0.20828802559020454

Upvotes: 1

skalet
skalet

Reputation: 384

200k elements is usually not considered a lot, you probably can just convert them to dense matrices and use scipy.stats.pearsonr.

For reference though, see my implementation for sparse vectors below.

Note that the numeric error is quite big, so the last assertion mostly fails. Also note the trick is to put the scalar subtraction outside the sum in order to keep all operations sparse.

import numpy as np
from scipy import sparse
from scipy.stats import pearsonr

def create_sparse_vector(n):
    return sparse.random(n,1)

def dense_pearsonr(x, y): 
    r, p = pearsonr(x.A.squeeze(), y.A.squeeze())
    return r

def sparse_pearsonr(x, y): 
    n = x.shape[0]
    assert(n == y.shape[0])
    mx = x.mean()
    my = y.mean()
    sx = np.sqrt(np.sum(x.multiply(x) - 2*mx*x) + mx**2)
    sy = np.sqrt(np.sum(y.multiply(y) - 2*my*y) + my**2)
    a = np.sum(x.multiply(y)) - n*mx*my
    b = sx*sy
    c = a / b 
    return min(max(c,-1.0),1.0)

N = 200000 
x = sparse.random(N,1)
y = sparse.random(N,1)
r1 = dense_pearsonr(x,y)
r2 = sparse_pearsonr(x,y)
print(r1)
print(r2)
assert(np.isclose(r1,r2)) # Warning: Assertion fails because of too big numerical error

Upvotes: 1

Related Questions