Reputation: 51
I am using statsmodels OLS to fit a series of points to a line:
import statsmodels.api as sm
Y = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15]
X = [[73.759999999999991], [73.844999999999999], [73.560000000000002],
[73.209999999999994], [72.944999999999993], [73.430000000000007],
[72.950000000000003], [73.219999999999999], [72.609999999999999],
[74.840000000000003], [73.079999999999998], [74.125], [74.75],
[74.760000000000005]]
ols = sm.OLS(Y, X)
r = ols.fit()
preds = r.predict()
print preds
And I get the following results:
[ 7.88819844 7.89728869 7.86680961 7.82937917 7.80103898 7.85290687
7.8015737 7.83044861 7.76521269 8.00369809 7.81547643 7.92723304
7.99407312 7.99514256]
These are an about 10 times off. What am I doing wrong? I tried adding a constant, that just makes the values 1000 times bigger. I don't know much about statistics, so maybe there is something I need to do with the data?
Upvotes: 4
Views: 1346
Reputation: 1948
I think you have switched your response and your predictor, like Michael Mayer suggested in his comment. If you plot the data with predictions from your model, you get something like this:
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
Y = np.array([1,2,3,4,5,6,7,8,9,11,12,13,14,15])
X = np.array([ 73.76 , 73.845, 73.56 , 73.21 , 72.945, 73.43 , 72.95 ,
73.22 , 72.61 , 74.84 , 73.08 , 74.125, 74.75 , 74.76 ])
Design = np.column_stack((np.ones(14), X))
ols = sm.OLS(Y, Design).fit()
preds = ols.predict()
plt.plot(X, Y, 'ko')
plt.plot(X, preds, 'k-')
plt.show()
If you switch X and Y, which is what I think you want, you get:
Design2 = np.column_stack((np.ones(14), Y))
ols2 = sm.OLS(X, Design2).fit()
preds2 = ols2.predict()
print preds2
[ 73.1386399 73.21305699 73.28747409 73.36189119 73.43630829
73.51072539 73.58514249 73.65955959 73.73397668 73.88281088
73.95722798 74.03164508 74.10606218 74.18047927]
plt.plot(Y, X, 'ko')
plt.plot(Y, preds2, 'k-')
plt.show()
Upvotes: 5