subset data frame according to a criteria based on correlation

Question

I have a data frame and will like to create a function in which only variables with low correlation are keep. This means looking at the pairwise correlation of each variable with the rest of the variables, and for those variables in which at least one correlation coefficient is greater than 0.4 then this variable and the one highly correlated are taken out from the data frame.

For example suppose I have a data frame:

 data <-  data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
cor(data, use="pairwise.complete.obs")

            x1         x2          x3         x4
x1  1.00000000 -0.3325757  0.08567911  0.2651721
x2 -0.33257569  1.0000000 -0.18761301  0.4660056
x3  0.08567911 -0.1876130  1.00000000 -0.3321003
x4  0.26517210  0.4660056 -0.33210031  1.0000000

Then I will like to return a data frame keeping only x1 and x3 (given that x2 and x4 have a correlation of 0.46)

thelatemail · Accepted Answer

Calculate the correlation matrix cd, checking if there is anything >0.4. Then subset away, ignoring the diagonals, where row==col:

cd <- abs(cor(data, use="pairwise.complete.obs")) > 0.4
data[-unique(col(cd)[cd & row(cd) != col(cd)])]

subset data frame according to a criteria based on correlation

Answers (2)

Related Questions