Luis
Luis

Reputation: 1584

The Intercept of a categorical multiple regression R is not the mean value?

Let's say I have 2 (categorical) variables and one continuous:

library(tidyverse)
set.seed(123)
ds <- data.frame(
  depression=rnorm(90,10,2),
  schooling_dummy=c(0,1,2),
  sex_dummy=c(0,1)
)

When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!

ds %>% group_by(sex_dummy) %>% 
+   summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
  sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
      <dbl> <chr>                                                
1      0    10.0436                                              
2      1.00 10.1640

The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.

ds %>% group_by(schooling_dummy) %>% 
+   summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
  schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
            <dbl> <chr>                                                
1            0    10.4398                                              
2            1.00 9.7122                                               
3            2.00 10.1593    

Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:

ds %>% group_by(schooling_dummy,sex_dummy) %>% 
+   summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
  schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
            <dbl>     <dbl> <chr>                                                
1            0         0    10.32548                                             
2            0         1.00 10.55404                                             
3            1.00      0    9.59305                                              
4            1.00      1.00 9.83139                                              
5            2.00      0    10.21218                                             
6            2.00      1.00 10.10648   

When I predict the model when both are 0:

predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
       1 
10.37956 

This result is related to depression (of course...) but still not What I was expecting, since: https://www.theanalysisfactor.com/interpret-the-intercept/ (Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)

What is the same for this previous forum post I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below: Thanks

library(tidyverse)
set.seed(123)
ds <- data.frame(
  depression=rnorm(90,10,2),
  schooling_dummy=c(0,1,2),
  sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>% 
  summarise(formatC(mean(depression),format="f", digits=4))

mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>% 
  summarise(formatC(mean(depression),format="f", digits=4))

mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") + 
             relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>% 
  summarise(formatC(mean(depression),format="f", digits=5))

predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))

Upvotes: 0

Views: 1100

Answers (1)

Chuck P
Chuck P

Reputation: 3923

Two errors in your thinking (although your R code works so it's not a programming error.

First and foremost you violated your own statement you have not dummy coded schooling it does not have only zeroes and ones it has 0,1 & 2.

Second you forgot the interaction effect in your lm modeling...

Try this...

library(tidyverse)
set.seed(123)
ds <- data.frame(
  depression=rnorm(90,10,2),
  schooling_dummy=c(0,1,2),
  sex_dummy=c(0,1)
)
# if you explicitly make these variables factors not integers R will do the right thing with them
ds$schooling_dummy<-factor(ds$schooling_dummy)
ds$sex_dummy<-factor(ds$sex_dummy)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
    summarise(formatC(mean(depression),format="f", digits=5))
# you need an asterisk in your lm model to include the interaction term
lm(depression ~ schooling_dummy * sex_dummy, data = ds)

The results give you the mean(s) you were expecting...

Call:
lm(formula = depression ~ schooling_dummy * sex_dummy, data = ds)

Coefficients:
                (Intercept)             schooling_dummy1             schooling_dummy2  
                  10.325482                    -0.732433                    -0.113305  
                 sex_dummy1  schooling_dummy1:sex_dummy1  schooling_dummy2:sex_dummy1  
                   0.228561                     0.009778                    -0.334254  

and FWIW you can avoid this sort of accidental misuse of categorical variables if your data is coded as characters to begin with... so if your data is coded this way:

ds <- data.frame(
  depression=rnorm(90,10,2),
  schooling=c("A","B","C"),
  sex=c("Male","Female")
) 

You're less likely to make the same mistake plus the results are easier to read...

Upvotes: 2

Related Questions