Reputation: 119
I'm having trouble understanding effect coding with glm. As an example:
data('mpg')
mpg$trans = as.factor(mpg$trans)
levels(mpg$trans)
[1] "auto(av)" "auto(l3)" "auto(l4)" "auto(l5)" "auto(l6)" "auto(s4)" "auto(s5)" "auto(s6)" "manual(m5)" "manual(m6)"
For effect (or dummy) coding, glm takes the first level as reference level, so in this case it will be "auto(av)".
mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.4173 0.7318 33.366 < 2e-16 ***
trans1 3.3827 2.3592 1.434 0.153017
trans2 2.5827 3.6210 0.713 0.476426
trans3 -2.4534 0.9157 -2.679 0.007928 **
trans4 -3.6993 1.0865 -3.405 0.000784 ***
trans5 -4.4173 2.1743 -2.032 0.043375 *
trans6 1.2494 2.9866 0.418 0.676105
trans7 0.9160 2.9866 0.307 0.759341
trans8 0.7702 1.4517 0.531 0.596262
trans9 1.8758 0.9845 1.905 0.058011 .
So I'm now thinking that trans1 actually is the second level ("auto(l3)"), because the first one is the reference level. To test this I relevel the factor, so that I will see the coefficient for the first level ("auto(av)"):
mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
levels(mpg$trans)
"auto(l3)" "auto(av)" "auto(l4)" "auto(l5)" "auto(l6)" "auto(s4)" "auto(s5)" "auto(s6)" "manual(m5)" "manual(m6)"
Now I'm expecting to see the coefficient of the first level and the coefficient of the second level is gone (because that is now the reference level):
mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.4173 0.7318 33.366 < 2e-16 ***
trans1 2.5827 3.6210 0.713 0.476426
trans2 3.3827 2.3592 1.434 0.153017
trans3 -2.4534 0.9157 -2.679 0.007928 **
trans4 -3.6993 1.0865 -3.405 0.000784 ***
trans5 -4.4173 2.1743 -2.032 0.043375 *
trans6 1.2494 2.9866 0.418 0.676105
trans7 0.9160 2.9866 0.307 0.759341
trans8 0.7702 1.4517 0.531 0.596262
trans9 1.8758 0.9845 1.905 0.058011 .
I see that the only thing that is changed, is that the first 2 coefficients are switched, so which level is taken as reference?? what am I missing here?
Upvotes: 2
Views: 613
Reputation: 46968
You are using contr.sum
where all levels are compared to the last level, and with the added constraint that all the cofficients (except the intercept) sum up to zero.
You can check it inside mpg_glm:
mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
mpg_glm$contrasts
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(av) 1 0 0 0 0 0 0 0
auto(l3) 0 1 0 0 0 0 0 0
auto(l4) 0 0 1 0 0 0 0 0
auto(l5) 0 0 0 1 0 0 0 0
auto(l6) 0 0 0 0 1 0 0 0
auto(s4) 0 0 0 0 0 1 0 0
auto(s5) 0 0 0 0 0 0 1 0
auto(s6) 0 0 0 0 0 0 0 1
manual(m5) 0 0 0 0 0 0 0 0
manual(m6) -1 -1 -1 -1 -1 -1 -1 -1
[,9]
auto(av) 0
auto(l3) 0
auto(l4) 0
auto(l5) 0
auto(l6) 0
auto(s4) 0
auto(s5) 0
auto(s6) 0
manual(m5) 1
manual(m6) -1
You have 9 columns here which are the 9 non-intercept coefficients in your model. They are more or less the means of each level minus the last level, because of the constraint to sum to zero. The last level is redundant and not shown here:
var_means = with(mpg,tapply(hwy,trans,mean))
cbind(delta_mean = var_means[-length(var_means)]-var_means[length(var_means)],
coef = coef(mpg_glm)[-1])
delta_mean coef
auto(av) 3.5894737 3.3827066
auto(l3) 2.7894737 2.5827066
auto(l4) -2.2466709 -2.4534380
auto(l5) -3.4925776 -3.6993447
auto(l6) -4.2105263 -4.4172934
auto(s4) 1.4561404 1.2493733
auto(s5) 1.1228070 0.9160399
auto(s6) 0.9769737 0.7702066
manual(m5) 2.0825771 1.8758101
Hence when you change the first level, you only change the order in which they appear, but the values don't change. You can easily check the contrast:
mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
new_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
new_glm$contrasts
$trans
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(l3) 1 0 0 0 0 0 0 0
auto(av) 0 1 0 0 0 0 0 0
auto(l4) 0 0 1 0 0 0 0 0
auto(l5) 0 0 0 1 0 0 0 0
auto(l6) 0 0 0 0 1 0 0 0
auto(s4) 0 0 0 0 0 1 0 0
auto(s5) 0 0 0 0 0 0 1 0
auto(s6) 0 0 0 0 0 0 0 1
manual(m5) 0 0 0 0 0 0 0 0
manual(m6) -1 -1 -1 -1 -1 -1 -1 -1
[,9]
auto(l3) 0
auto(av) 0
auto(l4) 0
auto(l5) 0
auto(l6) 0
auto(s4) 0
auto(s5) 0
auto(s6) 0
manual(m5) 1
manual(m6) -1
For what you describe, as changing the reference and having other levels relative this the ref, it should be contr.treatment
.
Upvotes: 2