Reputation: 3308
I am trying to get in sample predictions from an OLS fit as below,
import numpy as np
import pandas as pd
import statsmodels.api as sm
macrodata = sm.datasets.macrodata.load_pandas().data
macrodata.index = pd.period_range('1959Q1', '2009Q3', freq='Q')
mod = sm.OLS(macrodata['realgdp'], sm.add_constant(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']])).fit()
mod.get_prediction(sm.add_constant(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']])).summary_frame(0.95).head()
This is fine. But if I alter the positions of regressors in mod.get_prediction
, I get different estimates,
mod.get_prediction(sm.add_constant(macrodata[['tbilrate', 'unemp', 'realdpi', 'realinv']])).summary_frame(0.95).head()
This is surprising. Can't mod.get_prediction
identify the regressors based on column names?
Upvotes: 1
Views: 704
Reputation: 46898
As noted in the comments, sm.OLS
will convert your data frame into an array for fitting, and likewise for prediction, it expects the predictors to be in the same order.
If you would like the column names to be used, you can use the formula interface, see the documentation for more details. Below I apply your example :
import statsmodels.api as sm
import statsmodels.formula.api as smf
macrodata = sm.datasets.macrodata.load_pandas().data
mod = smf.ols(formula='realgdp ~ realdpi + realinv + tbilrate + unemp', data=macrodata)
res = mod.fit()
In the order provided :
res.get_prediction(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']]).summary_frame(0.95).head()
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 2716.423418 14.608110 2715.506229 2717.340607 2710.782460 2722.064376
1 2802.820840 13.714821 2801.959737 2803.681943 2797.188729 2808.452951
2 2781.041564 12.615903 2780.249458 2781.833670 2775.419588 2786.663539
3 2786.894138 12.387428 2786.116377 2787.671899 2781.274166 2792.514110
4 2848.982580 13.394688 2848.141577 2849.823583 2843.353507 2854.611653
Results are the same if we flip the columns:
res.get_prediction(macrodata[['tbilrate', 'unemp', 'realdpi', 'realinv']]).summary_frame(0.95).head()
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 2716.423418 14.608110 2715.506229 2717.340607 2710.782460 2722.064376
1 2802.820840 13.714821 2801.959737 2803.681943 2797.188729 2808.452951
2 2781.041564 12.615903 2780.249458 2781.833670 2775.419588 2786.663539
3 2786.894138 12.387428 2786.116377 2787.671899 2781.274166 2792.514110
4 2848.982580 13.394688 2848.141577 2849.823583 2843.353507 2854.611653
Upvotes: 1