Reputation: 81

Interpreting Estimate of categorical variable coefficient in lm() summary() in R

I have a linear model of class "lm" that I am viewing with summary(lm), a toy version of which is:

fit <- lm(Strength ~ Age + Sex, data = mydata)

summary(fit)

Understandably, Age is a continuous variable while Sex is a categorical variable. The relevant part of the summary(fit) output looks like:

             Estimate 
(Intercept)  -1.838e-01
Age          -5.264e-03
Sex.L        3.260e-01

How should I interpret this, specifically the categorical variable? I understand this to mean:

Strength = -0.1838 + (0.005264 * Age) + (0.326 * Sex)

but is this correct, and what value would Sex take? 1 for one sex, and 0 for the other? And how should I check which sex takes the value 1? Since my factor levels for Sex are Male and Female, I assume .L is a dummy variable for one of them, but I don't know how to check this.

Any advice would be much appreciated.

Thank you very much.

Upvotes: 0

Answers (1)

Zheyuan Li

Reputation: 73285

The name for the coefficient is "Sex.L"! This implies that Sex is an ordered categorical variable, and polynomial contrast encoding instead of treatment encoding was used. In this case, Sex in your equation does not simply take 0 or 1.

You really need to convert this ordered factor to the usual factor first:

mydata$Sex <- factor(mydata$Sex, ordered = FALSE)

You can check levels(mydata$Sex) at this stage. The 1st level will be dropped, and the coefficient of Sex is for the 2nd level. Note that using a different contrast will result in different coefficients.

You can also control levels to be in your desired order, say:

mydata$Sex <- factor(mydata$Sex, levels = c("Male", "Female"), ordered = FALSE)

Note that changing the order of levels gives different regression coefficients, too.

Anyway, as long as Sex is the usual factor (i.e., is.ordered(mydata$Sex) is FALSE), treatment contrast encoding will be applied. The 1st level is coded as 0, while the 2nd level is coded as 1. Suppose the fitted model coefficients are a, b and c, then the equation will be:

Strength = a + b * Age + c * Sex

where Sex is 0 for the 1st level, and 1 for the 2nd level.

A bit of background:

The "L" in "Sex.L" means "Linear", which is an indication of polynomial contrast. If the factor has 4 levels instead, we will see "L" (Linear), "Q" (Quadratic) and "C" (Cubic).

However, if Sex is the usual factor, the reported name should be "SexMale" or "SexFemale". Yes, this is informative enough.

If we see "SexMale", then "Female" is the 1st level, so in the equation, Sex is 0 for Female and 1 for Male.
If we see "SexFemale", then "Male" is the 1st level, so in the equation, Sex is 0 for Male and 1 for Female.

This naming convention for categorical variables after contrast encoding is very helpful.

A reproducible example

Since OP did not provide a reproducible example, I decided to simulate a dataset (where Sex is an ordered factor) to help readers follow what I said above.

mydata <- structure(list(Strength = c(-0.4484, -0.4584, -0.4765, -0.4676, 
-0.4979, -0.507, -0.5094, -0.5071, -0.5046, -0.5346, -0.5302, 
-0.5298, -0.5354, -0.5489, -0.5646, -0.5858, -0.5731, -0.5368, 
-0.5418, -0.5521, -0.5967, -0.5826, -0.5751, -0.5914, -0.6069, 
-0.5831, -0.6045, -0.6111, -0.618, -0.6375, -0.634, -0.6212, 
-0.6496, -0.6387, -0.6387, -0.6695, -0.6413, -0.6499, -0.6763, 
-0.6826, -0.6579, -0.7051, -0.6982, -0.7004, -0.7101, -0.6964, 
-0.6958, -0.7583, -0.7247, -0.7117, -0.7328, -0.0037, -0.003, 
-0.0095, 0.0137, -0.0228, -0.025, -0.0339, -0.041, -0.0271, -0.0303, 
-0.0633, -0.0572, -0.0542, -0.0648, -0.087, -0.0983, -0.0625, 
-0.0832, -0.0776, -0.1046, -0.1158, -0.1331, -0.1137, -0.1288, 
-0.1366, -0.1538, -0.1346, -0.1348, -0.1698, -0.1726, -0.1798, 
-0.1888, -0.1735, -0.1724, -0.183, -0.2001, -0.2029, -0.1812, 
-0.2126, -0.2086, -0.2278, -0.2279, -0.2294, -0.208, -0.2575, 
-0.258, -0.2356, -0.2417, -0.2406, -0.2683, -0.2914), Age = c(10L, 
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 
37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 
50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 
25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 
38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 
51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L), Sex = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), levels = c("Female", "Male"), class = c("ordered", 
"factor"))), row.names = c(NA, -102L), class = "data.frame")

A model that OP got:

fit1 <- lm(Strength ~ Age + Sex, data = mydata)
#(Intercept)          Age        Sex.L  
#  -0.178985    -0.005426     0.329874  

is.ordered(mydata$Sex)
#[1] TRUE

Convert Sex to the usual factor:

mydata$Sex <- factor(mydata$Sex, ordered = FALSE)

is.ordered(mydata$Sex)
#[1] FALSE

levels(mydata$Sex)
#[1] "Female" "Male"  

fit2 <- lm(Strength ~ Age + Sex, data = mydata)
#(Intercept)          Age      SexMale  
#  -0.412241    -0.005426     0.466512

Control order of levels:

mydata$Sex <- factor(mydata$Sex, levels = c("Male", "Female"), ordered = FALSE)

is.ordered(mydata$Sex)
#[1] FALSE

levels(mydata$Sex)
#[1] "Male"   "Female"

fit3 <- lm(Strength ~ Age + Sex, data = mydata)
#(Intercept)          Age    SexFemale  
#   0.054270    -0.005426    -0.466512

Extensive reading

(I was made aware of these Q & A just now.)

Upvotes: 4

Interpreting Estimate of categorical variable coefficient in lm() summary() in R

Answers (1)

Related Questions