Desta Haileselassie Hagos
Desta Haileselassie Hagos

Reputation: 26196

Missing intercepts of OLS Regression models in Python statsmodels

I am running a rolling for example of 100 window OLS regression estimation of the dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format.

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a prediction of Y (i.e. predict the current value of Y according to the previous 3 rolling values of X. For this, I have the following python script work using statsmodels.

# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm


df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() # to drop nans in case there are any
window = 100
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

# Predicted values in a row
 df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

#print(df['predicted'])

print(temp)

Which gives me a sample output of the following format.

         time   X   Y        a           b1           b2  predicted
0    0.000543   0  10     None         None         None       NaN
1    0.000575   0  10     None         None         None       NaN
2    0.041324   1  10     None         None         None       NaN
3    0.041331   2  10     None         None         None       NaN
4    0.041336   3  10     None         None         None       NaN
..        ...  ..  ..      ...          ...          ...       ...
50    0.041340   4  10       10            0  1.55431e-15       NaN
51    0.041345   5  10       10   1.7053e-13  7.77156e-16        10
52    0.041350   6  10       10  1.74623e-09 -7.99361e-15        10
53    0.041354   7  10       10  6.98492e-10 -6.21725e-15        10
..        ...  ..  ..      ...          ...          ...       ...
509  0.160835  38  20       20  4.88944e-09 -1.15463e-14        20
510  0.160839  39  20       20  1.86265e-09  5.32907e-15        20
..        ...  ..  ..      ...          ...          ...       ...

Finally, I want to include the mean squared error (MSE) for all the prediction (a summary of the OLS regression analysis) values. For example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value of y at the current row is 6 and therefore the mse will be (10-6)^2. The sm.OLS returns an instance of this class <class 'statsmodels.regression.linear_model.OLS'> when we do print (RollOLS.summary()).

OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -48.50
Date:                Tue, 04 Jul 2017   Prob (F-statistic):               1.00
Time:                        22:19:18   Log-Likelihood:                 2359.7
No. Observations:                 100   AIC:                            -4713.
Df Residuals:                      97   BIC:                            -4706.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        239.0000   2.58e-09   9.26e+10      0.000       239.000   239.000
time        4.547e-13   2.58e-10      0.002      0.999     -5.12e-10  5.13e-10
X          -3.886e-16    1.1e-13     -0.004      0.997     -2.19e-13  2.19e-13
==============================================================================
Omnibus:                       44.322   Durbin-Watson:                   0.000
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               86.471
Skew:                          -1.886   Prob(JB):                     1.67e-19
Kurtosis:                       5.556   Cond. No.                     9.72e+04
==============================================================================

But the value of rsquared(print (RollOLS.rsquared)), for example, should have been between 0 and 1 instead of -inf and this seems to be the issue with missing intercepts. If we want to print the mse, we do print (RollOLS.mse_model)... etc as per the documentation. How can we add the intercepts and print the regression statistics with the correct values as we do for the predicted values? What am I doing wrong in here? Or is there another way of doing this using scikit-learnlibraries?

Upvotes: 0

Views: 4710

Answers (2)

FLab
FLab

Reputation: 7506

Short Answer

The value of r^2 is going to be +/- inf as long as y remains constant over the regression window (100 observations in your case). You can find more details below, but intuition is that r^2 is the proportion of y's variance explained by X: if y's variance is zero, r^2 is simply not well defined.

Possible solution: Try to use a longer window, or resample Y and X so that Y does not remain constant for so many consecutive observations.

Long Answer

Looking at this I honestly think this is not the right dataset for the regression. This is a simple plot of the data:

enter image description here

Does a linear combination of X and time explain Y? Mmm...doesn't look plausible. Y almost looks like a discrete variable, so you probably want to look at logistic regressions.

To come to your question, the R^2 is the "the proportion of the variance in the dependent variable that is predictable from the independent variable(s)". From wikipedia:

enter image description here

In your case it is very likely that Y is constant over 100 observations, hence it has 0 variance, that produces a division by zero hence the inf.

So I am afraid you should not look to fixes in the code, but you should rethink the problem and the way of fitting the data.

Upvotes: 1

Diego Aguado
Diego Aguado

Reputation: 1596

Ok so I prepared this small example so you can visualize what a Poisson regression could do.

import statsmodels as sm
import matplotlib.pyplot as plt
poi_model = sm.discrete.discrete_model.Poisson

x = np.random.uniform(0, 20,1000)
s = np.random.poisson( x*(0.5) , 1000)
plt.bar(x,s)
plt.show()

This generates random poisson counts.

Now the way to fit a poisson regression to the data is the following:

my_model = poi_model(endog=s, exog=x)
my_model = my_model.fit()
my_model.summary()

The summary displays a number of statistics but if you want to compute the mean square error you could do that like so:

preds = my_model.predict()
mse = np.mean(np.square(preds - s))

If you want to predict new values do the following:

my_model.predict(exog=new_value)

Upvotes: 0

Related Questions