How to simulate for multiple regression analysis data with a dummy variable

Question

I would like to simulate data for a regression analysis, which involves a dummary variable. While a regression recovers the slopes, it does not recover the intercept:

beta <- c(2,3,4)
x1   <- rnorm(100,50,5)
x2   <- sample(c(0,1), replace=T,100)
eps  <- rnorm(100, 0, 5)

y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

summary(lm(y~x1 + x2))

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6598  -2.7433  -0.2873   2.4616  13.2250 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.2858     5.3470  -1.550    0.124    
x1            3.2216     0.1070  30.109  < 2e-16 ***
x2            3.9209     0.9065   4.325  3.7e-05 ***

I know the dummy variable shifts the intercept up or down, but I'm confused about what adjustments to make to create a data set that can recover the intercept. Any suggestions are greatly appreciated, thank you.

bzki · Accepted Answer

You should expect that on average, you will recover the true coefficient values. But on a given simulated dataset, your coefficient estimates will deviate. I repeat your simulation setup 1000 times and then I take the average of the coefficient estimates.

beta <- c(2,3,4)
do_experiment <- function(n = 100, eps.sd = 5) {
  x1   <- rnorm(n, 50, 5)
  x2   <- sample(c(0,1), replace=T, n)
  eps  <- rnorm(n, 0, eps.sd)

  y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

  return(coef(lm(y~x1 + x2)))
}

set.seed(1212)
coefEstimates <- replicate(1000, do_experiment(n = 100))
rowMeans(coefEstimates)

(Intercept)          x1          x2 
   1.972531    3.000327    4.010408

apply(coefEstimates, 1, sd)

(Intercept)          x1          x2 
 5.05588111  0.09988136  1.00523822

You could decrease the variance of your error term if you want less variability in your intercept estimate from simulation to simulation. As @rookie mentions, you can also increase the sample size.

set.seed(1213)
coefEstimates2 <- replicate(1000, do_experiment(n = 100, eps.sd = 1))
rowMeans(coefEstimates2)

(Intercept)          x1          x2 
   2.046227    2.999186    3.996409 

apply(coefEstimates2, 1, sd)

(Intercept)          x1          x2 
  1.0459009   0.0205488   0.1995421

How to simulate for multiple regression analysis data with a dummy variable

Answers (2)

Related Questions