Bogaso
Bogaso

Reputation: 3308

Get prediction of OLS fit from statsmodels

I am trying to get in sample predictions from an OLS fit as below,

import numpy as np
import pandas as pd
import statsmodels.api as sm

macrodata = sm.datasets.macrodata.load_pandas().data
macrodata.index = pd.period_range('1959Q1', '2009Q3', freq='Q')
mod = sm.OLS(macrodata['realgdp'], sm.add_constant(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']])).fit()
mod.get_prediction(sm.add_constant(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']])).summary_frame(0.95).head()

This is fine. But if I alter the positions of regressors in mod.get_prediction, I get different estimates,

mod.get_prediction(sm.add_constant(macrodata[['tbilrate', 'unemp', 'realdpi', 'realinv']])).summary_frame(0.95).head()

This is surprising. Can't mod.get_prediction identify the regressors based on column names?

Upvotes: 1

Views: 704

Answers (1)

StupidWolf
StupidWolf

Reputation: 46898

As noted in the comments, sm.OLS will convert your data frame into an array for fitting, and likewise for prediction, it expects the predictors to be in the same order.

If you would like the column names to be used, you can use the formula interface, see the documentation for more details. Below I apply your example :

import statsmodels.api as sm
import statsmodels.formula.api as smf

macrodata = sm.datasets.macrodata.load_pandas().data
mod = smf.ols(formula='realgdp ~ realdpi + realinv + tbilrate + unemp', data=macrodata)
res = mod.fit()

In the order provided :

res.get_prediction(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']]).summary_frame(0.95).head()

          mean    mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  obs_ci_upper
0  2716.423418  14.608110    2715.506229    2717.340607   2710.782460   2722.064376
1  2802.820840  13.714821    2801.959737    2803.681943   2797.188729   2808.452951
2  2781.041564  12.615903    2780.249458    2781.833670   2775.419588   2786.663539
3  2786.894138  12.387428    2786.116377    2787.671899   2781.274166   2792.514110
4  2848.982580  13.394688    2848.141577    2849.823583   2843.353507   2854.611653

Results are the same if we flip the columns:

res.get_prediction(macrodata[['tbilrate', 'unemp', 'realdpi', 'realinv']]).summary_frame(0.95).head()

          mean    mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  obs_ci_upper
0  2716.423418  14.608110    2715.506229    2717.340607   2710.782460   2722.064376
1  2802.820840  13.714821    2801.959737    2803.681943   2797.188729   2808.452951
2  2781.041564  12.615903    2780.249458    2781.833670   2775.419588   2786.663539
3  2786.894138  12.387428    2786.116377    2787.671899   2781.274166   2792.514110
4  2848.982580  13.394688    2848.141577    2849.823583   2843.353507   2854.611653

Upvotes: 1

Related Questions