Reputation: 492
I'm using linalg.lstsq to build a regression line inside a function like this:
def lsreg(x, y):
if not isinstance(x, np.ndarray):
x = np.array(x)
if not isinstance(y, np.ndarray):
y = np.array(y)
A = np.array([x, np.ones(len(x))])
ret = np.linalg.lstsq(A.T, y)
return ret[0]
and calling it like this:
x = np.array([10000001, 10000002, 10000003])
y = np.array([3.0, 4.0, 5.0])
regress = lsreg(x, y)
fit = regress[0]*x + regress[1]
print fit
and the output y get is:
[ 3. 4. 5.]
So far, so good. Now, if I change x like this:
x = np.array([100000001, 100000002, 100000003])
y = np.array([3.0, 4.0, 5.0])
regress = lsreg(x, y)
fit = regress[0]*x + regress[1]
print fit
I get
[ 3.99999997 4.00000001 4.00000005]
instead of something close to 3, 4 and 5.
Any clue on what is going on ?
Upvotes: 4
Views: 2269
Reputation: 492
I tried with scipy:
from scipy import stats
x = np.array([100000001, 100000002, 100000003])
y = np.array([3.0, 4.0, 5.0])
res = stats.linregress(x, y)
print x*res[0] + res[1]
and I get:
[ 3. 4. 5.]
Upvotes: 0
Reputation: 10219
Your problem is due to numerical errors that occur when solving an ill-conditioned system of equations.
In [115]: np.linalg.lstsq(A.T, y)
Out[115]:
(array([ 3.99999993e-08, 3.99999985e-16]),
array([], dtype=float64),
1,
array([ 1.73205084e+08, 1.41421352e-08]))
Notice that np.linalg.lstsq returned "1" for the rank of the matrix AA.T formed from your input matrix. This means it thinks your matrix is rank 1 and hence is ill-conditioned (as your least square system is a 2 x 2 system of equations it should be rank 2). The second singular value which is close to 0 confirms this. This is the reason for the "wrong" result. You should google along the lines of "numerical linear algebra numerical errors" to learn more about this problem.
Upvotes: 2