Reputation: 733
I am trying to fit a multinomial logistic regression and then predicting the result from samples.
### RZS_TC is my dataframe
RZS_TC.loc[RZS_TC['Mean_Treecover'] <= 50, 'Mean_Treecover' ] = 0
RZS_TC.loc[RZS_TC['Mean_Treecover'] > 50, 'Mean_Treecover' ] = 1
RZS_TC[['MAP']+['Sr']+['delTC']+['Mean_Treecover']].head()
[Output]:
MAP Sr delTC Mean_Treecover
302993741 2159.297363 452.975647 2.666672 1.0
217364332 3242.351807 65.615341 8.000000 1.0
390863334 1617.215454 493.124054 5.666666 0.0
446559668 1095.183105 498.373383 -8.000000 0.0
246078364 2804.615234 98.981110 -4.000000 1.0
1000000 rows × 7 columns
#Fitting a logistic regression
from statsmodels.formula.api import mnlogit
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
print(model.summary2())
[Output]:
Results: MNLogit
====================================================================
Model: MNLogit Pseudo R-squared: 0.364
Dependent Variable: Mean_Treecover AIC: 831092.4595
Date: 2021-04-02 13:51 BIC: 831139.7215
No. Observations: 1000000 Log-Likelihood: -4.1554e+05
Df Model: 3 LL-Null: -6.5347e+05
Df Residuals: 999996 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
--------------------------------------------------------------------
Mean_Treecover = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept -5.2200 0.0119 -438.4468 0.0000 -5.2434 -5.1967
MAP 0.0023 0.0000 491.0859 0.0000 0.0023 0.0023
Sr 0.0016 0.0000 90.6805 0.0000 0.0015 0.0016
delTC -0.0093 0.0002 -39.9022 0.0000 -0.0098 -0.0089
However, wherever I try to predict the using the model.predict()
function, I get the following error.
prediction = model.predict(np.array(RZS_TC[['MAP']+['Sr']+['delTC']]))
[Output]: ERROR! Session/line number was not unique in database. History logging moved to new session 2627
Does anyone know how to troubleshoot this? Is there something that I might be doing wrong?
Upvotes: 2
Views: 567
Reputation: 46908
The model adds an intercept so you need to include that, using an example data:
from statsmodels.formula.api import mnlogit
import pandas as pd
import numpy as np
RZS_TC = pd.DataFrame(np.random.uniform(0,1,(20,4)),
columns=['MAP','Sr','delTC','Mean_Treecover'])
RZS_TC['Mean_Treecover'] = round(RZS_TC['Mean_Treecover'])
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
You can see the dimensions of your fitted data:
model.model.exog[:5,]
Out[16]:
array([[1. , 0.33914763, 0.79358056, 0.3103758 ],
[1. , 0.45915785, 0.94991271, 0.27203524],
[1. , 0.55527662, 0.15122108, 0.80675951],
[1. , 0.18493681, 0.89854583, 0.66760684],
[1. , 0.38300074, 0.6945397 , 0.28128137]])
Which is the same as if you add a constant:
import statsmodels.api as sm
sm.add_constant((RZS_TC[['MAP','Sr','delTC']])
const MAP Sr delTC
0 1.0 0.339148 0.793581 0.310376
1 1.0 0.459158 0.949913 0.272035
2 1.0 0.555277 0.151221 0.806760
3 1.0 0.184937 0.898546 0.667607
If you have a data.frame with the same column names, it will just be:
prediction = model.predict(RZS_TC[['MAP','Sr','delTC']])
Or if you just need the fitted values, do:
model.fittedvalues
Upvotes: 1