Reputation: 21
I would like to simulate one vector that is correlated with two other existing variables. What I tried so far
# some correlation matrix
desiredCorrelations = matrix(c(1, .4, 0,
.4, 1, .3,
0, .3, 1), nrow = 3)
# some simulated data based on the correlation matrix
dat = mvrnorm(n = 1000, mu = rep(3, 3), Sigma = desiredCorrelations, empirical = TRUE)
n = nrow(dat)
k = ncol(desiredCorrelations)
x = matrix( rnorm(n*k), nc=k )
x[,1] = dat[,1]
y = x %*% solve(chol(var(x))) %*% chol(desiredCorrelations)
# cor(y) # Desired correlation matrix
apply(dat, 2, summary)
apply(y, 2, summary)
Based on this piece of code, the correlations are correct, but only the first column of y is identical to the first column of the originally simulated data. However, I would like two columns to stay the same, while the third column is simulated with the the desired correlation matrix taken into account.
Thanks in advance for any suggestions or tips!
Upvotes: 2
Views: 1227
Reputation: 155
use the faux package: https://debruine.github.io/faux/
First create the two variables you want the third to be correlated with (rnorm_pre() will require at least 4-ish such pairs of variables, so I generate 4 sets rather than just one, below):
> sx1x2 <- rnorm_multi(n=4, vars=2,mu=0,sd=1,r=.1)
> sx1x2
X1 X2
1 0.295255640 -0.28770654
2 -0.326833060 1.10811640
3 0.008780557 0.04629117
4 -1.658893622 -1.56070681
>
Next, create the third variable so that it is correlated by 0.5 with the first and 0.3 with the second:
> y1<-rnorm_pre(sx1x2, r = c(0.5, 0.3))
> y1
[1] 1.5205031 0.4098795 -1.1482235 -1.4352354
>
Obviously with only 4 samples, the measured correlations will be off:
> cov(sx1x2[,1],y1)
[1] 0.8037426
> cov(sx1x2[,2],y1)
[1] 0.6967516
But with a larger sample, you can see the generate data is correct:
> sx1x2 <- rnorm_multi(n=1000, vars=2,mu=0,sd=1,r=.1)
> y1<-rnorm_pre(sx1x2, r = c(0.5, 0.3))
> c(cov(sx1x2[,1],y1),cov(sx1x2[,2],y1))
[1] 0.5278817 0.3277606
>
Hope that helps!
Upvotes: 1
Reputation: 868
I wrote a function for this based on MattBagg's code that takes a vector x and returns a vector with the specified mean, sd, and correlation:
simcor <- function (x, ymean=0, ysd=1, correlation=0) {
n <- length(x)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
yresult <- ymean + ysd * z
yresult
}
Upvotes: 3