Reputation: 11793
I'm a little confused about how to interpret coefficient in multiple regression with two categorical variables. Use mtcars dataset as an example. According to some online sources and books, the coefficient of one categorical variable is the different of mean between the level and reference level, given the other variable is at reference level. In this example, according to the aggregated result, the coefficient of factor(vs)1 should be 81-91=-10, but it's not. It's -13.92. Those claims seems to be wrong.
Can someone clarify one on this? How to interpret the coefficients in terms of 'mean difference'?
fit <- lm(data=df, hp~factor(vs)+factor(cyl))
Call:
lm(formula = hp ~ factor(vs) + factor(cyl), data = df)
Coefficients:
(Intercept) factor(vs)1 factor(cyl)6 factor(cyl)8
95.29 -13.92 34.95 113.93
# then mean of hp at different levels of vs ans cyl.
aggregate(hp~vs+cyl, df, mean)
0 4 91.0000
1 4 81.8000
0 6 131.6667
1 6 115.2500
0 8 209.2143
My second question is: what if the treat those categorical variable as ordered factors? There will be linear or quadratic term for those factors. But how should I interpret the coefficients?
lm(data=df, hp~factor(vs, ordered=TRUE)+factor(cyl, ordered=TRUE))
Call:
lm(formula = hp ~ factor(vs, ordered = TRUE) + factor(cyl, ordered = TRUE),
data = df)
Coefficients:
(Intercept) factor(vs, ordered = TRUE).L
137.96 -9.84
factor(cyl, ordered = TRUE).L factor(cyl, ordered = TRUE).Q
80.56 17.97
Thank you very much in advance.
Upvotes: 0
Views: 709
Reputation: 269654
Regarding the first question, if
cyl
is at its reference level and vs
is at the 1 level then the mean they are referring to is 95.29 - 13.92 + 0 and when vs
and cyl
are both at the reference level the mean is 95.29 + 0 + 0 so -13.92 is the difference between those two means.
By mean they are referring to the expected value of y which is estimated by the predicted value. If we write the regression equation as y = terms + residuals
then the expected value of y equals the terms, i.e.
E(y) = E(terms + residuals)
= E(terms) + E(residuals)
= terms + 0 <- because terms is not random and residuals have mean 0
= terms
Regarding the second question which asks about ordered factors they are rarely used and I would ignore their existence for linear models. In the book Introductory Statistics with R by Peter Dalgaard, he mentions that the implementation in R assumes that the levels are equidistant. Such assumption is questionable in general.
Upvotes: 0