Simulate data from regression model with exact parameters in R

Question

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).

I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.

Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?

asachet · Accepted Answer

One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.

x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)

x <- c(x, x)
y <- c(y + eps, y - eps)

fit <- lm(y ~ x)
# (Intercept)            x  
#         1.5          4.0 

plot(fit)

Residuals are normally distributed...

... but exhibit an anormally perfect symmetry!

EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.

library(dplyr)

# Data and residuals
df = tibble(
  # Predictors
  x1 = 1:100,  # Continuous
  x2 = rep(c(0, 1), each=50),  # Dummy-coded categorical

  # Generate y from model, including interaction term
  y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
  noise = rnorm(100)  # Residuals
)

# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
  df %>% mutate(y = y_model + noise),
  df %>% mutate(y = y_model - noise)  # Mirrored
)

# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept)          x1          x2       x1:x2 
#     1.50000     4.00000    -2.10000     8.76543

Simulate data from regression model with exact parameters in R

Answers (2)

Related Questions