Kevin
Kevin

Reputation: 37

How to simulate for multiple regression analysis data with a dummy variable

I would like to simulate data for a regression analysis, which involves a dummary variable. While a regression recovers the slopes, it does not recover the intercept:

beta <- c(2,3,4)
x1   <- rnorm(100,50,5)
x2   <- sample(c(0,1), replace=T,100)
eps  <- rnorm(100, 0, 5)

y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

summary(lm(y~x1 + x2))

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6598  -2.7433  -0.2873   2.4616  13.2250 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.2858     5.3470  -1.550    0.124    
x1            3.2216     0.1070  30.109  < 2e-16 ***
x2            3.9209     0.9065   4.325  3.7e-05 ***

I know the dummy variable shifts the intercept up or down, but I'm confused about what adjustments to make to create a data set that can recover the intercept. Any suggestions are greatly appreciated, thank you.

Upvotes: 2

Views: 529

Answers (2)

bzki
bzki

Reputation: 681

You should expect that on average, you will recover the true coefficient values. But on a given simulated dataset, your coefficient estimates will deviate. I repeat your simulation setup 1000 times and then I take the average of the coefficient estimates.

beta <- c(2,3,4)
do_experiment <- function(n = 100, eps.sd = 5) {
  x1   <- rnorm(n, 50, 5)
  x2   <- sample(c(0,1), replace=T, n)
  eps  <- rnorm(n, 0, eps.sd)

  y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

  return(coef(lm(y~x1 + x2)))
}

set.seed(1212)
coefEstimates <- replicate(1000, do_experiment(n = 100))
rowMeans(coefEstimates)

(Intercept)          x1          x2 
   1.972531    3.000327    4.010408

apply(coefEstimates, 1, sd)

(Intercept)          x1          x2 
 5.05588111  0.09988136  1.00523822 

You could decrease the variance of your error term if you want less variability in your intercept estimate from simulation to simulation. As @rookie mentions, you can also increase the sample size.

set.seed(1213)
coefEstimates2 <- replicate(1000, do_experiment(n = 100, eps.sd = 1))
rowMeans(coefEstimates2)

(Intercept)          x1          x2 
   2.046227    2.999186    3.996409 

apply(coefEstimates2, 1, sd)

(Intercept)          x1          x2 
  1.0459009   0.0205488   0.1995421 

Upvotes: 3

rookie
rookie

Reputation: 681

Its just an issue of the number of data points. The more data the closer you will get to the correct intercept.

set.seed(1111)
beta <- c(2,3,4)
x1   <- rnorm(1E6,50,5)
x2   <- sample(c(0,1), replace=T,1E6)
eps  <- rnorm(1E6, 0, 5)

y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

summary(lm(y~x1 + x2))

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.6565  -3.3651  -0.0003   3.3694  25.8225 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.9914611  0.0504585   39.47   <2e-16 ***
x1          3.0003014  0.0009994 3002.14   <2e-16 ***
x2          3.9902120  0.0099931  399.30   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.997 on 999997 degrees of freedom
Multiple R-squared:  0.9017,    Adjusted R-squared:  0.9017 
F-statistic: 4.587e+06 on 2 and 999997 DF,  p-value: < 2.2e-16

Upvotes: 1

Related Questions