The Intercept of a categorical multiple regression R is not the mean value?

Question

Let's say I have 2 (categorical) variables and one continuous:

library(tidyverse)
set.seed(123)
ds <- data.frame(
  depression=rnorm(90,10,2),
  schooling_dummy=c(0,1,2),
  sex_dummy=c(0,1)
)

When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!

ds %>% group_by(sex_dummy) %>% 
+   summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
  sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
                                                       
1      0    10.0436                                              
2      1.00 10.1640

The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.

ds %>% group_by(schooling_dummy) %>% 
+   summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
  schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
                                                             
1            0    10.4398                                              
2            1.00 9.7122                                               
3            2.00 10.1593

Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:

ds %>% group_by(schooling_dummy,sex_dummy) %>% 
+   summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
  schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
                                                                  
1            0         0    10.32548                                             
2            0         1.00 10.55404                                             
3            1.00      0    9.59305                                              
4            1.00      1.00 9.83139                                              
5            2.00      0    10.21218                                             
6            2.00      1.00 10.10648

When I predict the model when both are 0:

predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
       1 
10.37956

This result is related to depression (of course...) but still not What I was expecting, since: (Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)

What is the same for this previous forum post I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below: Thanks

library(tidyverse)
set.seed(123)
ds <- data.frame(
  depression=rnorm(90,10,2),
  schooling_dummy=c(0,1,2),
  sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>% 
  summarise(formatC(mean(depression),format="f", digits=4))

mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>% 
  summarise(formatC(mean(depression),format="f", digits=4))

mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") + 
             relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>% 
  summarise(formatC(mean(depression),format="f", digits=5))

predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))

The Intercept of a categorical multiple regression R is not the mean value?

Answers (1)

Related Questions