Lion Behrens
Lion Behrens

Reputation: 319

Simulating multiple regression data with fixed R2: How to incorporate correlated variables?

I'd like to simulate data for a multiple linear regression with four predictors where I am free to specify

I arrived at a solution that fulfills the first two points but is based on the assumption that all independent variables are not related to each other (see code below). In order to get standardized regression coefficients, I sample from population variables with mean=0 and variance=1.

# Specify population variance/covariance of four predictor variables that is sampled from
sigma.1 <- matrix(c(1,0,0,0,   
                    0,1,0,0,   
                    0,0,1,0,    
                    0,0,0,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance
n.obs <- 50000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.30

# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon

# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))

Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0564 -0.6310 -0.0048  0.6339  3.7119 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.496063   0.004175  118.82   <2e-16 ***
V1          0.402588   0.004189   96.11   <2e-16 ***
V2          0.291636   0.004178   69.81   <2e-16 ***
V3          0.247347   0.004171   59.30   <2e-16 ***
V4          0.253810   0.004175   60.79   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9335 on 49995 degrees of freedom
Multiple R-squared:  0.299, Adjusted R-squared:  0.299 
F-statistic:  5332 on 4 and 49995 DF,  p-value: < 2.2e-16

Problem: If my predictor variables are correlated, so if their variance/covariance matrix is specified without the off-diagonal elements being 0, the r2 and regression coefficients differ largely from how I want them to be, e.g. by using

sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                    0.25,1,0.25,0.25,   
                    0.25,0.25,1,0.25,    
                    0.25,0.25,0.25,1),nrow=4,ncol=4)

Any suggestions? Thanks!

Upvotes: 4

Views: 1868

Answers (2)

groceryheist
groceryheist

Reputation: 1747

Here is my solution. It is more general than the previous:

M <- t(as.matrix(beta)) * sample1[1:length(beta)]
var.epsilon <- sum(cov(M))*(1 - r2)/r2

sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + M + sample1$epsilon

# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))

Lion Beherns's answer starts in the right direction but it is mathematically wrong, and it is also less general than it could be.

var.epsilon, the variance of the error term equals the sum of the variances and covariances of the predictor variables times their coefficients. None of the existing answers get this right. Instead, they assume the variances of the predictors equals 1 and sum their squares and then add the correlations (not the covariance).

To help understand how my solution relates to the previous. I'll share a more verbose implementation:

# this matrix has to be positive definite, but doesn't have to have constant off the diagonal and doesn't have to have 1s on the diagonal.
sigma.1 <- matrix(c(1,0.12,0.23,0.7,   
                    0.35,1,0.02,0.15,   
                    0.35,0.35,1,0.3,    
                    0.5,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor variables that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15
    
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Add error variable based on desired r2
var.epsilon <- (var(beta[1]*sample1$V1) + 
var(beta[2]*sample1$V2)+var(beta[3]*sample1$V3)+var(beta[4]*sample1$V4)+
2*cov(beta[1]*sample1$V1, beta[2]*sample1$V2) +
2*cov(beta[1]*sample1$V1, beta[3]*sample1$V3) + 
2*cov(beta[1]*sample1$V1,beta[4]*sample1$V4) + 
2*cov(beta[2]*sample1$V2,beta[3]*sample1$V3) + 
2*cov(beta[2]*sample1$V2, beta[4]*sample1$V4) + 
2*cov(beta[3]*sample1$V3,beta[4]*sample1$V4))*((1 - r2)/r2)

sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
  beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon

# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))

This way of calculating var.epsilon is super verbose and inefficient. We can improve that with some linear algebra.

Upvotes: 1

Lion Behrens
Lion Behrens

Reputation: 319

After thinking about my problem a bit more, I found an answer.

The code above first samples the predictor variables with a given degree of correlation among each other. Then a column for the error is added based on the desired value of r2. Then with all of that together a column for y is added.

So far, the line that creates the error is just

var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

So it assumes that every beta coefficient contributes 100% to the explanation of y (=no interrelation of independent variables). But if x-variables are related, every beta is not(!) contributing 100%. That means the variance of the error has to be bigger, because the variables take some variability from each other.

How much bigger? Just adapt the creation of the error term like follows:

var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)

So the degree to which the independent variables are correlated is just added to the error variance by adding cor(sample1$V1, sample1$V2). In the case of the interrelation being 0.25, e.g. by using

sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                0.25,1,0.25,0.25,   
                0.25,0.25,1,0.25,    
                0.25,0.25,0.25,1),nrow=4,ncol=4)

cor(sample1$V1, sample1$V2) resembles 0.25 and this value is added to the variance of the error term.

Assuming that all interrelations are equal, like this, any degree of interrelation among the independent variables can be specified, together with the true standardized regression coefficients and an desired R2.

Proof:

sigma.1 <- matrix(c(1,0.35,0.35,0.35,   
                    0.35,1,0.35,0.35,   
                    0.35,0.35,1,0.35,    
                    0.35,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15

# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
  beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon

# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))

> summary(lm(y~V1+V2+V3+V4, data=sample1))

Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.7250  -1.3696   0.0017   1.3650   9.0460 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.499554   0.002869  174.14   <2e-16 ***
V1          0.406360   0.003236  125.56   <2e-16 ***
V2          0.298892   0.003233   92.45   <2e-16 ***
V3          0.247581   0.003240   76.42   <2e-16 ***
V4          0.253510   0.003241   78.23   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.028 on 499995 degrees of freedom
Multiple R-squared:  0.1558,    Adjusted R-squared:  0.1557 
F-statistic: 2.306e+04 on 4 and 499995 DF,  p-value: < 2.2e-16

Upvotes: 6

Related Questions