Reputation: 1039
I have a data frame and will like to create a function in which only variables with low correlation are keep. This means looking at the pairwise correlation of each variable with the rest of the variables, and for those variables in which at least one correlation coefficient is greater than 0.4 then this variable and the one highly correlated are taken out from the data frame.
For example suppose I have a data frame:
data <- data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
cor(data, use="pairwise.complete.obs")
x1 x2 x3 x4
x1 1.00000000 -0.3325757 0.08567911 0.2651721
x2 -0.33257569 1.0000000 -0.18761301 0.4660056
x3 0.08567911 -0.1876130 1.00000000 -0.3321003
x4 0.26517210 0.4660056 -0.33210031 1.0000000
Then I will like to return a data frame keeping only x1 and x3 (given that x2 and x4 have a correlation of 0.46)
Upvotes: 0
Views: 1062
Reputation: 34406
You could try:
set.seed(50)
data <- data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
mycor <- cor(data, use="pairwise.complete.obs")
data[, !apply(mycor, 2, function (x) max(x[-which.max(x)]) >.4 | min(x[which.min(x)]) < -.4) ]
Upvotes: 2
Reputation: 93813
Calculate the correlation matrix cd
, checking if there is anything >0.4
.
Then subset away, ignoring the diag
onals, where row==col
:
cd <- abs(cor(data, use="pairwise.complete.obs")) > 0.4
data[-unique(col(cd)[cd & row(cd) != col(cd)])]
Upvotes: 3