Reputation: 7740
Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/climate_change.csv
I'm building a multiple linear regression model with pandas:
import pandas as pd
import statsmodels.api as sm
climate = pd.read_csv("climate_change.csv")
climate_train = climate.query('Year <= 2006')
climate_test = climate.query('Year > 2006')
y = climate_train['Temp']
x = climate_train[['MEI', 'N2O', 'TSI', 'Aerosols']]
x = sm.add_constant(x)
model2 = sm.OLS(y, x).fit()
model2.summary()
And I want to test it on my test dataset:
model2.predict(climate_test)
But I get the following error:
ValueError: shapes (24,11) and (5,) not aligned: 11 (dim 1) != 5 (dim 0)
From this question I suspect it might have something to do with the fact that I'm not adding a constant to my test dataset, but
model2.predict(sm.add_constant(climate_test))
doesn't work either. If I list the independent variables explicitly, it works:
model2.predict(sm.add_constant(climate_test[['MEI', 'N2O', 'TSI', 'Aerosols']]))
But since model2 already "knows" these variables, I can't see a reason why I should repeat them in the method call.
How to predict() without calling independent variables explicitly?
Upvotes: 2
Views: 1627
Reputation: 13768
I don't think there's a way to do it fully automatically.
If you're trying to save typing, store the "x-columns" in a variable for later use: xvars = ['MEI', 'N2O', 'TSI', 'Aerosols']
and use this both early and late in the code to save typing.
Upvotes: 2