Reputation: 804
I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
The variables will be regression model will produce coefficients like this: Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Upvotes: 10
Views: 4546
Reputation: 1584
Here is another code to generate multiple linear regression with errors follow normal distribution: OPS sorry this code just produces multiple regression
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}
Upvotes: -1
Reputation: 21433
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
sd.value
.The resulting vector for 10 results is:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55
,
which takes roughly 13 seconds on my machine.
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
Upvotes: 2
Reputation: 15163
So the formula for R^2 is 1-var(residual)/var(total)
In this case, the variance of Y
is going to be 3^2+2^2+sd.value^2
, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2
.
So you can compute rsquared explicitly with this function:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared)
should give you what you want.
We can test this as follows:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
Upvotes: 3
Reputation: 3288
Take a look at this code, it should be close enough to what you want:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
Upvotes: 8