Reputation: 81
I have a linear model of class "lm" that I am viewing with summary(lm), a toy version of which is:
fit <- lm(Strength ~ Age + Sex, data = mydata)
summary(fit)
Understandably, Age
is a continuous variable while Sex
is a categorical variable. The relevant part of the summary(fit)
output looks like:
Estimate
(Intercept) -1.838e-01
Age -5.264e-03
Sex.L 3.260e-01
How should I interpret this, specifically the categorical variable? I understand this to mean:
Strength = -0.1838 + (0.005264 * Age) + (0.326 * Sex)
but is this correct, and what value would Sex
take? 1 for one sex, and 0 for the other? And how should I check which sex takes the value 1? Since my factor levels for Sex are Male and Female, I assume .L is a dummy variable for one of them, but I don't know how to check this.
Any advice would be much appreciated.
Thank you very much.
Upvotes: 0
Views: 1599
Reputation: 73285
The name for the coefficient is "Sex.L"! This implies that Sex
is an ordered categorical variable, and polynomial contrast encoding instead of treatment encoding was used. In this case, Sex in your equation does not simply take 0 or 1.
You really need to convert this ordered factor to the usual factor first:
mydata$Sex <- factor(mydata$Sex, ordered = FALSE)
You can check levels(mydata$Sex)
at this stage. The 1st level will be dropped, and the coefficient of Sex
is for the 2nd level. Note that using a different contrast will result in different coefficients.
You can also control levels to be in your desired order, say:
mydata$Sex <- factor(mydata$Sex, levels = c("Male", "Female"), ordered = FALSE)
Note that changing the order of levels gives different regression coefficients, too.
Anyway, as long as Sex
is the usual factor (i.e., is.ordered(mydata$Sex)
is FALSE), treatment contrast encoding will be applied. The 1st level is coded as 0, while the 2nd level is coded as 1. Suppose the fitted model coefficients are a
, b
and c
, then the equation will be:
Strength = a + b * Age + c * Sex
where Sex
is 0 for the 1st level, and 1 for the 2nd level.
A bit of background:
The "L" in "Sex.L" means "Linear", which is an indication of polynomial contrast. If the factor has 4 levels instead, we will see "L" (Linear), "Q" (Quadratic) and "C" (Cubic).
However, if Sex
is the usual factor, the reported name should be "SexMale" or "SexFemale". Yes, this is informative enough.
If we see "SexMale", then "Female" is the 1st level, so in the equation, Sex is 0 for Female and 1 for Male.
If we see "SexFemale", then "Male" is the 1st level, so in the equation, Sex is 0 for Male and 1 for Female.
This naming convention for categorical variables after contrast encoding is very helpful.
A reproducible example
Since OP did not provide a reproducible example, I decided to simulate a dataset (where Sex
is an ordered factor) to help readers follow what I said above.
mydata <- structure(list(Strength = c(-0.4484, -0.4584, -0.4765, -0.4676,
-0.4979, -0.507, -0.5094, -0.5071, -0.5046, -0.5346, -0.5302,
-0.5298, -0.5354, -0.5489, -0.5646, -0.5858, -0.5731, -0.5368,
-0.5418, -0.5521, -0.5967, -0.5826, -0.5751, -0.5914, -0.6069,
-0.5831, -0.6045, -0.6111, -0.618, -0.6375, -0.634, -0.6212,
-0.6496, -0.6387, -0.6387, -0.6695, -0.6413, -0.6499, -0.6763,
-0.6826, -0.6579, -0.7051, -0.6982, -0.7004, -0.7101, -0.6964,
-0.6958, -0.7583, -0.7247, -0.7117, -0.7328, -0.0037, -0.003,
-0.0095, 0.0137, -0.0228, -0.025, -0.0339, -0.041, -0.0271, -0.0303,
-0.0633, -0.0572, -0.0542, -0.0648, -0.087, -0.0983, -0.0625,
-0.0832, -0.0776, -0.1046, -0.1158, -0.1331, -0.1137, -0.1288,
-0.1366, -0.1538, -0.1346, -0.1348, -0.1698, -0.1726, -0.1798,
-0.1888, -0.1735, -0.1724, -0.183, -0.2001, -0.2029, -0.1812,
-0.2126, -0.2086, -0.2278, -0.2279, -0.2294, -0.208, -0.2575,
-0.258, -0.2356, -0.2417, -0.2406, -0.2683, -0.2914), Age = c(10L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L,
37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L,
50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 10L, 11L,
12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,
25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,
38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,
51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L), Sex = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), levels = c("Female", "Male"), class = c("ordered",
"factor"))), row.names = c(NA, -102L), class = "data.frame")
A model that OP got:
fit1 <- lm(Strength ~ Age + Sex, data = mydata)
#(Intercept) Age Sex.L
# -0.178985 -0.005426 0.329874
is.ordered(mydata$Sex)
#[1] TRUE
Convert Sex
to the usual factor:
mydata$Sex <- factor(mydata$Sex, ordered = FALSE)
is.ordered(mydata$Sex)
#[1] FALSE
levels(mydata$Sex)
#[1] "Female" "Male"
fit2 <- lm(Strength ~ Age + Sex, data = mydata)
#(Intercept) Age SexMale
# -0.412241 -0.005426 0.466512
Control order of levels:
mydata$Sex <- factor(mydata$Sex, levels = c("Male", "Female"), ordered = FALSE)
is.ordered(mydata$Sex)
#[1] FALSE
levels(mydata$Sex)
#[1] "Male" "Female"
fit3 <- lm(Strength ~ Age + Sex, data = mydata)
#(Intercept) Age SexFemale
# 0.054270 -0.005426 -0.466512
Extensive reading
(I was made aware of these Q & A just now.)
Upvotes: 4