Reputation: 1584
Let's say I have 2 (categorical) variables and one continuous:
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!
ds %>% group_by(sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.0436
2 1.00 10.1640
The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.
ds %>% group_by(schooling_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.4398
2 1.00 9.7122
3 2.00 10.1593
Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:
ds %>% group_by(schooling_dummy,sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
<dbl> <dbl> <chr>
1 0 0 10.32548
2 0 1.00 10.55404
3 1.00 0 9.59305
4 1.00 1.00 9.83139
5 2.00 0 10.21218
6 2.00 1.00 10.10648
When I predict the model when both are 0:
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
1
10.37956
This result is related to depression (of course...) but still not What I was expecting, since:
(Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)
What is the same for this previous forum post I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below: Thanks
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") +
relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
Upvotes: 0
Views: 1100
Reputation: 3923
Two errors in your thinking (although your R code works so it's not a programming error.
First and foremost you violated your own statement you have not dummy coded schooling
it does not have only zeroes and ones it has 0,1 & 2.
Second you forgot the interaction effect in your lm
modeling...
Try this...
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
# if you explicitly make these variables factors not integers R will do the right thing with them
ds$schooling_dummy<-factor(ds$schooling_dummy)
ds$sex_dummy<-factor(ds$sex_dummy)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
# you need an asterisk in your lm model to include the interaction term
lm(depression ~ schooling_dummy * sex_dummy, data = ds)
The results give you the mean(s) you were expecting...
Call:
lm(formula = depression ~ schooling_dummy * sex_dummy, data = ds)
Coefficients:
(Intercept) schooling_dummy1 schooling_dummy2
10.325482 -0.732433 -0.113305
sex_dummy1 schooling_dummy1:sex_dummy1 schooling_dummy2:sex_dummy1
0.228561 0.009778 -0.334254
and FWIW you can avoid this sort of accidental misuse of categorical variables if your data is coded as characters to begin with... so if your data is coded this way:
ds <- data.frame(
depression=rnorm(90,10,2),
schooling=c("A","B","C"),
sex=c("Male","Female")
)
You're less likely to make the same mistake plus the results are easier to read...
Upvotes: 2