Reputation: 319
I'd like to simulate data for a multiple linear regression with four predictors where I am free to specify
I arrived at a solution that fulfills the first two points but is based on the assumption that all independent variables are not related to each other (see code below). In order to get standardized regression coefficients, I sample from population variables with mean=0 and variance=1.
# Specify population variance/covariance of four predictor variables that is sampled from
sigma.1 <- matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 50000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.30
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-4.0564 -0.6310 -0.0048 0.6339 3.7119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.496063 0.004175 118.82 <2e-16 ***
V1 0.402588 0.004189 96.11 <2e-16 ***
V2 0.291636 0.004178 69.81 <2e-16 ***
V3 0.247347 0.004171 59.30 <2e-16 ***
V4 0.253810 0.004175 60.79 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9335 on 49995 degrees of freedom
Multiple R-squared: 0.299, Adjusted R-squared: 0.299
F-statistic: 5332 on 4 and 49995 DF, p-value: < 2.2e-16
Problem: If my predictor variables are correlated, so if their variance/covariance matrix is specified without the off-diagonal elements being 0, the r2 and regression coefficients differ largely from how I want them to be, e.g. by using
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
Any suggestions? Thanks!
Upvotes: 4
Views: 1868
Reputation: 1747
Here is my solution. It is more general than the previous:
M <- t(as.matrix(beta)) * sample1[1:length(beta)]
var.epsilon <- sum(cov(M))*(1 - r2)/r2
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + M + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
Lion Beherns's answer starts in the right direction but it is mathematically wrong, and it is also less general than it could be.
var.epsilon
, the variance of the error term equals the sum of the variances and covariances of the predictor variables times their coefficients. None of the existing answers get this right. Instead, they assume the variances of the predictors equals 1 and sum their squares and then add the correlations (not the covariance).
To help understand how my solution relates to the previous. I'll share a more verbose implementation:
# this matrix has to be positive definite, but doesn't have to have constant off the diagonal and doesn't have to have 1s on the diagonal.
sigma.1 <- matrix(c(1,0.12,0.23,0.7,
0.35,1,0.02,0.15,
0.35,0.35,1,0.3,
0.5,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor variables that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (var(beta[1]*sample1$V1) +
var(beta[2]*sample1$V2)+var(beta[3]*sample1$V3)+var(beta[4]*sample1$V4)+
2*cov(beta[1]*sample1$V1, beta[2]*sample1$V2) +
2*cov(beta[1]*sample1$V1, beta[3]*sample1$V3) +
2*cov(beta[1]*sample1$V1,beta[4]*sample1$V4) +
2*cov(beta[2]*sample1$V2,beta[3]*sample1$V3) +
2*cov(beta[2]*sample1$V2, beta[4]*sample1$V4) +
2*cov(beta[3]*sample1$V3,beta[4]*sample1$V4))*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
This way of calculating var.epsilon
is super verbose and inefficient. We can improve that with some linear algebra.
Upvotes: 1
Reputation: 319
After thinking about my problem a bit more, I found an answer.
The code above first samples the predictor variables with a given degree of correlation among each other. Then a column for the error is added based on the desired value of r2. Then with all of that together a column for y is added.
So far, the line that creates the error is just
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
So it assumes that every beta coefficient contributes 100% to the explanation of y (=no interrelation of independent variables). But if x-variables are related, every beta is not(!) contributing 100%. That means the variance of the error has to be bigger, because the variables take some variability from each other.
How much bigger? Just adapt the creation of the error term like follows:
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
So the degree to which the independent variables are correlated is just added to the error variance by adding cor(sample1$V1, sample1$V2)
. In the case of the interrelation being 0.25, e.g. by using
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
cor(sample1$V1, sample1$V2)
resembles 0.25 and this value is added to the variance of the error term.
Assuming that all interrelations are equal, like this, any degree of interrelation among the independent variables can be specified, together with the true standardized regression coefficients and an desired R2.
Proof:
sigma.1 <- matrix(c(1,0.35,0.35,0.35,
0.35,1,0.35,0.35,
0.35,0.35,1,0.35,
0.35,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
> summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-10.7250 -1.3696 0.0017 1.3650 9.0460
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.499554 0.002869 174.14 <2e-16 ***
V1 0.406360 0.003236 125.56 <2e-16 ***
V2 0.298892 0.003233 92.45 <2e-16 ***
V3 0.247581 0.003240 76.42 <2e-16 ***
V4 0.253510 0.003241 78.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.028 on 499995 degrees of freedom
Multiple R-squared: 0.1558, Adjusted R-squared: 0.1557
F-statistic: 2.306e+04 on 4 and 499995 DF, p-value: < 2.2e-16
Upvotes: 6