alkamid
alkamid

Reputation: 7740

predict() in pandas statsmodels, adding independent variables

Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/climate_change.csv

I'm building a multiple linear regression model with pandas:

import pandas as pd
import statsmodels.api as sm

climate = pd.read_csv("climate_change.csv")
climate_train = climate.query('Year <= 2006')
climate_test = climate.query('Year > 2006')

y = climate_train['Temp']
x = climate_train[['MEI', 'N2O', 'TSI', 'Aerosols']]
x = sm.add_constant(x)
model2 = sm.OLS(y, x).fit()
model2.summary()

And I want to test it on my test dataset:

model2.predict(climate_test)

But I get the following error:

ValueError: shapes (24,11) and (5,) not aligned: 11 (dim 1) != 5 (dim 0)

From this question I suspect it might have something to do with the fact that I'm not adding a constant to my test dataset, but

model2.predict(sm.add_constant(climate_test))

doesn't work either. If I list the independent variables explicitly, it works:

model2.predict(sm.add_constant(climate_test[['MEI', 'N2O', 'TSI', 'Aerosols']]))

But since model2 already "knows" these variables, I can't see a reason why I should repeat them in the method call.

How to predict() without calling independent variables explicitly?

Upvotes: 2

Views: 1627

Answers (1)

8one6
8one6

Reputation: 13768

I don't think there's a way to do it fully automatically.

If you're trying to save typing, store the "x-columns" in a variable for later use: xvars = ['MEI', 'N2O', 'TSI', 'Aerosols'] and use this both early and late in the code to save typing.

Upvotes: 2

Related Questions