Reputation: 48476
In understand that when I have a category variable in a model passed to a statsmodels
fit
that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand', 'China' and 'Mars' I will get variables in my model of the form
Location[T.Thailand]
with one of the value not represented. By default the excluded variable seems to be the least common one. Is there a way to specify — ideally within the model specification — which value is treated as the "base value" and excluded?
Upvotes: 25
Views: 20667
Reputation: 39
Ok, maybe someone will find this one helpfull. I needed to set a new baseline category for the dependent variable, I had no idea how to do it. I searched and found nothing, so i simply added a "_" for the other categories. If you have 3 categories A, B, C, and you want your baseline to be C you just change the labeles from A and B to _A and _B. It works. I appears that the baseline category is defined by sorted()
Maybe someone knows a proper way to do it, this is not very phytonic, ja.
Upvotes: 3
Reputation: 217
If you use single quotes to wrap your string, reference's argument needs to be wrapped with double quotes. Very easy mistake to make. I was using single quotes on both.
For example:
'y ~ C(Location, Treatment(reference="China"))'
is correct.
'y ~ C(Location, Treatment(reference='China'))'
is not correct.
Upvotes: 4
Reputation: 8283
You can pass a reference
arg to the Treatment contrast, using syntax like
"y ~ C(Location, Treatment(reference='China'))"
http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment
If you have a better suggestion for naming conventions please file an issue with patsy.
Upvotes: 42