Jonas Lindeløv
Jonas Lindeløv

Reputation: 5683

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).

I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.

Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?

Upvotes: 4

Views: 1504

Answers (2)

asachet
asachet

Reputation: 6921

One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.

x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)

x <- c(x, x)
y <- c(y + eps, y - eps)

fit <- lm(y ~ x)
# (Intercept)            x  
#         1.5          4.0 

plot(fit)

Residuals are normally distributed...

enter image description here

... but exhibit an anormally perfect symmetry!

enter image description here

EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.

library(dplyr)

# Data and residuals
df = tibble(
  # Predictors
  x1 = 1:100,  # Continuous
  x2 = rep(c(0, 1), each=50),  # Dummy-coded categorical

  # Generate y from model, including interaction term
  y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
  noise = rnorm(100)  # Residuals
)

# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
  df %>% mutate(y = y_model + noise),
  df %>% mutate(y = y_model - noise)  # Mirrored
)

# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept)          x1          x2       x1:x2 
#     1.50000     4.00000    -2.10000     8.76543 

Upvotes: 4

Roland
Roland

Reputation: 132989

You could do rejection sampling:

set.seed(42)
tol <- 1e-8

x <- 1:100
continue <- TRUE
while(continue) {
  y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
  if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}

coef(lm(y ~ x))
#(Intercept)           x 
#   1.500013    4.000023

Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Upvotes: 2

Related Questions