into_r
into_r

Reputation: 395

Statsmodels : change variable name/label in the output

When using Statsmodels while defining the reference level of a categorical variables under the R-style formula framework, the name of the variable in the output is quite big, for instance :

import statsmodels.api as sm
import statsmodels.formula.api as smf
sm.Logit.from_formula("y~ c(my_variable, ,Treatment(reference= 'reference_level')) "

will output

C(x,Treatment(reference='reference_level'))[some_value] 

as the variable name in the model summary.

How can I rename this output label to something more readable without chaining the variable name ?

Thanks.

Upvotes: 3

Views: 1673

Answers (2)

Oaty
Oaty

Reputation: 351

Under the hood, the formula api uses Patsy to apply your formula string to your data. The long names come from this process.

A quick fix is when you call summary(), you can optionally include an xname argument. xname is a list of labels that are applied to each row of the summary's coef table. The length of xname must be the same length as the params attribute on your Result (the object returned by fit()).

fit = sm.Logit.from_formula("variable ~ C(type,Treatment(reference= 'B'))",data=data).fit()
fit.summary(xname=["label 1", "label 2", "label 3", "etc."])

Take a look at the Summary object for more customizations.

Upvotes: 3

StupidWolf
StupidWolf

Reputation: 46888

You can use pd.Categorical to set the levels. The reference will come first. For example in the below dataset, reference = 'B' :

import statsmodels.api as sm
import statsmodels.formula.api as smf 
import pandas as pd
import numpy as np

np.random.seed(222)
data  = pd.DataFrame({'variable':np.random.randint(0,2,100),
                      'type': np.random.choice(['A','B','C'],100)
                     })

fit = sm.Logit.from_formula("variable ~ C(type,Treatment(reference= 'B'))",data=data).fit()
fit.summary()

                                         coef   std err z   P>|z|   [0.025  0.975]
Intercept                               0.1542  0.393   0.392   0.695   -0.617  0.925
C(type, Treatment(reference='B'))[T.A] -0.4261  0.515   -0.828  0.408   -1.435  0.583
C(type, Treatment(reference='B'))[T.C]  0.2288  0.517   0.443   0.658   -0.784  1.241

And if you use pd.Categorical, you get the same result:

data['type'] = pd.Categorical(data['type'],categories=['B','A','C'],ordered=True)
fit = sm.Logit.from_formula("variable ~ type",data=data).fit()
fit.summary()

             coef   std err z   P>|z|   [0.025  0.975]
Intercept   0.1542  0.393   0.392   0.695   -0.617  0.925
type[T.A]   -0.4261 0.515   -0.828  0.408   -1.435  0.583
type[T.C]   0.2288  0.517   0.443   0.658   -0.784  1.241

Upvotes: 0

Related Questions