dayum
dayum

Reputation: 1093

Python Memory error on scipy stats. Scipy linalg lstsq <> manual beta

Not sure if this question belongs here or on crossvalidated but since the primary issue is programming language related, I am posting it here.

Inputs:

Y= big 2D numpy array (300000,30)

X= 1D array (30,)

Desired Output:

B= 1D array (300000,) each element of which regression coefficient of regressing each row (element of length 30) of Y against X

So B[0] = scipy.stats.linregress(X,Y[0])[0]

I tried this first:

B = scipy.stats.linregress(X,Y)[0] 

hoping that it will broadcast X according to shape of Y. Next I broadcast X myself to match the shape of Y. But on both occasions, I got this error:

    File "C:\...\scipy\stats\stats.py", line 3011, in linregress
        ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
    File "C:\...\numpy\lib\function_base.py", line 1766, in cov
        return (dot(X, X.T.conj()) / fact).squeeze()
  MemoryError

I used manual approach to calculate beta, and on Sascha's suggestion below also used scipy.linalg.lstsq as follows

B = lstsq(Y.T, X)[0] # first estimate of beta
Y1=Y-Y.mean(1)[:,None]
X1=X-X.mean()
B1= np.dot(Y1,X1)/np.dot(X1,X1) # second estimate of beta

The two estimates of beta are very different however:

>>> B1
Out[10]: array([0.135623, 0.028919, -0.106278, ..., -0.467340, -0.549543, -0.498500])
>>> B
Out[11]: array([0.000014, -0.000073, -0.000058, ..., 0.000002, -0.000000, 0.000001])

Upvotes: 0

Views: 669

Answers (1)

sascha
sascha

Reputation: 33542

Scipy's linregress will output slope+intercept which defines the regression-line.

If you want to access the coefficients naturally, scipy's lstsq might be more appropriate, which is an equivalent formulation.

Of course you need to feed it with the correct dimensions (your data is not ready; needs preprocessing; swap dims).

Code

import numpy as np
from scipy.linalg import lstsq
Y = np.random.random((300000,30))
X = np.random.random(30)
x, res, rank, s = lstsq(Y.T, X)  # Y transposed!
print(x)
print(x.shape)

Output

[  1.73122781e-05   2.70274135e-05   9.80840639e-06 ...,  -1.84597771e-05
   5.25035470e-07   2.41275026e-05]
(300000,)

Upvotes: 1

Related Questions