Reputation: 26196
I am running a rolling for example of 100 window OLS regression estimation
of the dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format.
time X Y
0.000543 0 10
0.000575 0 10
0.041324 1 10
0.041331 2 10
0.041336 3 10
0.04134 4 10
...
9.987735 55 239
9.987739 56 239
9.987744 57 239
9.987749 58 239
9.987938 59 239
The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a prediction of Y
(i.e. predict the current value of Y
according to the previous 3 rolling values of X
. For this, I have the following python
script work using statsmodels
.
# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
df=pd.read_csv('estimated_pred.csv')
df=df.dropna() # to drop nans in case there are any
window = 100
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
temp=df.iloc[i-window:i,:]
RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit()
df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]
# Predicted values in a row
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']
#print(df['predicted'])
print(temp)
Which gives me a sample output of the following format.
time X Y a b1 b2 predicted
0 0.000543 0 10 None None None NaN
1 0.000575 0 10 None None None NaN
2 0.041324 1 10 None None None NaN
3 0.041331 2 10 None None None NaN
4 0.041336 3 10 None None None NaN
.. ... .. .. ... ... ... ...
50 0.041340 4 10 10 0 1.55431e-15 NaN
51 0.041345 5 10 10 1.7053e-13 7.77156e-16 10
52 0.041350 6 10 10 1.74623e-09 -7.99361e-15 10
53 0.041354 7 10 10 6.98492e-10 -6.21725e-15 10
.. ... .. .. ... ... ... ...
509 0.160835 38 20 20 4.88944e-09 -1.15463e-14 20
510 0.160839 39 20 20 1.86265e-09 5.32907e-15 20
.. ... .. .. ... ... ... ...
Finally, I want to include the mean squared error (MSE
) for all the prediction (a summary of the OLS
regression analysis) values. For example, if we look at row 5, the value of X
is 2 and the value of Y
is 10. Let's say the prediction value of y
at the current row is 6 and therefore the mse
will be (10-6)^2
. The sm.OLS
returns an instance of this class <class 'statsmodels.regression.linear_model.OLS'>
when we do print (RollOLS.summary())
.
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: -inf
Model: OLS Adj. R-squared: -inf
Method: Least Squares F-statistic: -48.50
Date: Tue, 04 Jul 2017 Prob (F-statistic): 1.00
Time: 22:19:18 Log-Likelihood: 2359.7
No. Observations: 100 AIC: -4713.
Df Residuals: 97 BIC: -4706.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 239.0000 2.58e-09 9.26e+10 0.000 239.000 239.000
time 4.547e-13 2.58e-10 0.002 0.999 -5.12e-10 5.13e-10
X -3.886e-16 1.1e-13 -0.004 0.997 -2.19e-13 2.19e-13
==============================================================================
Omnibus: 44.322 Durbin-Watson: 0.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.471
Skew: -1.886 Prob(JB): 1.67e-19
Kurtosis: 5.556 Cond. No. 9.72e+04
==============================================================================
But the value of rsquared
(print (RollOLS.rsquared))
, for example, should have been between 0
and 1
instead of -inf
and this seems to be the issue with missing intercepts
. If we want to print the mse
, we do print (RollOLS.mse_model)
... etc as per the documentation. How can we add the intercepts
and print the regression statistics with the correct values as we do for the predicted values? What am I doing wrong in here? Or is there another way of doing this using scikit-learn
libraries?
Upvotes: 0
Views: 4710
Reputation: 7506
Short Answer
The value of r^2
is going to be +/- inf
as long as y
remains constant over the regression window (100 observations in your case). You can find more details below, but intuition is that r^2
is the proportion of y
's variance explained by X
: if y
's variance is zero, r^2
is simply not well defined.
Possible solution: Try to use a longer window, or resample Y and X so that Y does not remain constant for so many consecutive observations.
Long Answer
Looking at this I honestly think this is not the right dataset for the regression. This is a simple plot of the data:
Does a linear combination of X and time explain Y? Mmm...doesn't look plausible. Y almost looks like a discrete variable, so you probably want to look at logistic regressions.
To come to your question, the R^2 is the "the proportion of the variance in the dependent variable that is predictable from the independent variable(s)". From wikipedia:
In your case it is very likely that Y is constant over 100 observations, hence it has 0 variance, that produces a division by zero hence the inf.
So I am afraid you should not look to fixes in the code, but you should rethink the problem and the way of fitting the data.
Upvotes: 1
Reputation: 1596
Ok so I prepared this small example so you can visualize what a Poisson regression could do.
import statsmodels as sm
import matplotlib.pyplot as plt
poi_model = sm.discrete.discrete_model.Poisson
x = np.random.uniform(0, 20,1000)
s = np.random.poisson( x*(0.5) , 1000)
plt.bar(x,s)
plt.show()
This generates random poisson counts.
Now the way to fit a poisson regression to the data is the following:
my_model = poi_model(endog=s, exog=x)
my_model = my_model.fit()
my_model.summary()
The summary displays a number of statistics but if you want to compute the mean square error you could do that like so:
preds = my_model.predict()
mse = np.mean(np.square(preds - s))
If you want to predict new values do the following:
my_model.predict(exog=new_value)
Upvotes: 0