Reputation: 395
When using Statsmodels while defining the reference level of a categorical variables under the R-style formula framework, the name of the variable in the output is quite big, for instance :
import statsmodels.api as sm
import statsmodels.formula.api as smf
sm.Logit.from_formula("y~ c(my_variable, ,Treatment(reference= 'reference_level')) "
will output
C(x,Treatment(reference='reference_level'))[some_value]
as the variable name in the model summary.
How can I rename this output label to something more readable without chaining the variable name ?
Thanks.
Upvotes: 3
Views: 1673
Reputation: 351
Under the hood, the formula api uses Patsy to apply your formula string to your data. The long names come from this process.
A quick fix is when you call summary()
, you can optionally include an xname
argument. xname
is a list of labels that are applied to each row of the summary's coef table. The length of xname
must be the same length as the params
attribute on your Result
(the object returned by fit()
).
fit = sm.Logit.from_formula("variable ~ C(type,Treatment(reference= 'B'))",data=data).fit()
fit.summary(xname=["label 1", "label 2", "label 3", "etc."])
Take a look at the Summary object for more customizations.
Upvotes: 3
Reputation: 46888
You can use pd.Categorical
to set the levels. The reference will come first. For example in the below dataset, reference = 'B'
:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(222)
data = pd.DataFrame({'variable':np.random.randint(0,2,100),
'type': np.random.choice(['A','B','C'],100)
})
fit = sm.Logit.from_formula("variable ~ C(type,Treatment(reference= 'B'))",data=data).fit()
fit.summary()
coef std err z P>|z| [0.025 0.975]
Intercept 0.1542 0.393 0.392 0.695 -0.617 0.925
C(type, Treatment(reference='B'))[T.A] -0.4261 0.515 -0.828 0.408 -1.435 0.583
C(type, Treatment(reference='B'))[T.C] 0.2288 0.517 0.443 0.658 -0.784 1.241
And if you use pd.Categorical, you get the same result:
data['type'] = pd.Categorical(data['type'],categories=['B','A','C'],ordered=True)
fit = sm.Logit.from_formula("variable ~ type",data=data).fit()
fit.summary()
coef std err z P>|z| [0.025 0.975]
Intercept 0.1542 0.393 0.392 0.695 -0.617 0.925
type[T.A] -0.4261 0.515 -0.828 0.408 -1.435 0.583
type[T.C] 0.2288 0.517 0.443 0.658 -0.784 1.241
Upvotes: 0