Reputation: 195
vec1 and vec2 are both 1x200000 sparse matrix. I want to calculate the Pearson correlation coefficient between them (equivalent to scipy.stats.pearsonr). Is there any way?
Upvotes: 1
Views: 837
Reputation: 231540
Define 2 sparse matrices:
In [15]: M1 = sparse.random(1,1000)
In [16]: M2 = sparse.random(1,1000)
In [17]: stats.pearsonr(M1.A[0], M2.A[0])
Out[17]: (0.20251345569257148, 1.0257264375653503e-10)
Same value:
In [23]: x=M1.A[0]; y=M2.A[0]
In [24]: x1=x-x.mean(); y1=y-y.mean()
In [25]: (x1*y1).sum()/np.sqrt((x1*x1).sum()*(y1*y1).sum())
Out[25]: 0.20251345569257148
But I can't do that with sparse matrices:
In [27]: M1.mean()
Out[27]: 0.0050088190479221925
In [28]: M1 - M1.mean()
...
NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported
Subtracting a nonzero value from a sparse matrix means that it is no longer sparse.
Ignoring the means:
In [36]: (M1.multiply(M2)).sum()/np.sqrt((M1.multiply(M1)).sum()*(M2.multiply(M2)).sum())
Out[36]: 0.20828802559020454
Upvotes: 1
Reputation: 384
200k elements is usually not considered a lot, you probably can just convert them to dense matrices and use scipy.stats.pearsonr.
For reference though, see my implementation for sparse vectors below.
Note that the numeric error is quite big, so the last assertion mostly fails. Also note the trick is to put the scalar subtraction outside the sum in order to keep all operations sparse.
import numpy as np
from scipy import sparse
from scipy.stats import pearsonr
def create_sparse_vector(n):
return sparse.random(n,1)
def dense_pearsonr(x, y):
r, p = pearsonr(x.A.squeeze(), y.A.squeeze())
return r
def sparse_pearsonr(x, y):
n = x.shape[0]
assert(n == y.shape[0])
mx = x.mean()
my = y.mean()
sx = np.sqrt(np.sum(x.multiply(x) - 2*mx*x) + mx**2)
sy = np.sqrt(np.sum(y.multiply(y) - 2*my*y) + my**2)
a = np.sum(x.multiply(y)) - n*mx*my
b = sx*sy
c = a / b
return min(max(c,-1.0),1.0)
N = 200000
x = sparse.random(N,1)
y = sparse.random(N,1)
r1 = dense_pearsonr(x,y)
r2 = sparse_pearsonr(x,y)
print(r1)
print(r2)
assert(np.isclose(r1,r2)) # Warning: Assertion fails because of too big numerical error
Upvotes: 1