Reputation: 37
I would like to simulate data for a regression analysis, which involves a dummary variable. While a regression recovers the slopes, it does not recover the intercept:
beta <- c(2,3,4)
x1 <- rnorm(100,50,5)
x2 <- sample(c(0,1), replace=T,100)
eps <- rnorm(100, 0, 5)
y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps
summary(lm(y~x1 + x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-12.6598 -2.7433 -0.2873 2.4616 13.2250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.2858 5.3470 -1.550 0.124
x1 3.2216 0.1070 30.109 < 2e-16 ***
x2 3.9209 0.9065 4.325 3.7e-05 ***
I know the dummy variable shifts the intercept up or down, but I'm confused about what adjustments to make to create a data set that can recover the intercept. Any suggestions are greatly appreciated, thank you.
Upvotes: 2
Views: 529
Reputation: 681
You should expect that on average, you will recover the true coefficient values. But on a given simulated dataset, your coefficient estimates will deviate. I repeat your simulation setup 1000 times and then I take the average of the coefficient estimates.
beta <- c(2,3,4)
do_experiment <- function(n = 100, eps.sd = 5) {
x1 <- rnorm(n, 50, 5)
x2 <- sample(c(0,1), replace=T, n)
eps <- rnorm(n, 0, eps.sd)
y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps
return(coef(lm(y~x1 + x2)))
}
set.seed(1212)
coefEstimates <- replicate(1000, do_experiment(n = 100))
rowMeans(coefEstimates)
(Intercept) x1 x2
1.972531 3.000327 4.010408
apply(coefEstimates, 1, sd)
(Intercept) x1 x2
5.05588111 0.09988136 1.00523822
You could decrease the variance of your error term if you want less variability in your intercept estimate from simulation to simulation. As @rookie mentions, you can also increase the sample size.
set.seed(1213)
coefEstimates2 <- replicate(1000, do_experiment(n = 100, eps.sd = 1))
rowMeans(coefEstimates2)
(Intercept) x1 x2
2.046227 2.999186 3.996409
apply(coefEstimates2, 1, sd)
(Intercept) x1 x2
1.0459009 0.0205488 0.1995421
Upvotes: 3
Reputation: 681
Its just an issue of the number of data points. The more data the closer you will get to the correct intercept.
set.seed(1111)
beta <- c(2,3,4)
x1 <- rnorm(1E6,50,5)
x2 <- sample(c(0,1), replace=T,1E6)
eps <- rnorm(1E6, 0, 5)
y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps
summary(lm(y~x1 + x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-25.6565 -3.3651 -0.0003 3.3694 25.8225
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9914611 0.0504585 39.47 <2e-16 ***
x1 3.0003014 0.0009994 3002.14 <2e-16 ***
x2 3.9902120 0.0099931 399.30 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.997 on 999997 degrees of freedom
Multiple R-squared: 0.9017, Adjusted R-squared: 0.9017
F-statistic: 4.587e+06 on 2 and 999997 DF, p-value: < 2.2e-16
Upvotes: 1