user2246905
user2246905

Reputation: 1039

subset data frame according to a criteria based on correlation

I have a data frame and will like to create a function in which only variables with low correlation are keep. This means looking at the pairwise correlation of each variable with the rest of the variables, and for those variables in which at least one correlation coefficient is greater than 0.4 then this variable and the one highly correlated are taken out from the data frame.

For example suppose I have a data frame:

 data <-  data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
cor(data, use="pairwise.complete.obs")

            x1         x2          x3         x4
x1  1.00000000 -0.3325757  0.08567911  0.2651721
x2 -0.33257569  1.0000000 -0.18761301  0.4660056
x3  0.08567911 -0.1876130  1.00000000 -0.3321003
x4  0.26517210  0.4660056 -0.33210031  1.0000000

Then I will like to return a data frame keeping only x1 and x3 (given that x2 and x4 have a correlation of 0.46)

Upvotes: 0

Views: 1062

Answers (2)

lroha
lroha

Reputation: 34406

You could try:

set.seed(50)
data <-  data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
mycor <- cor(data, use="pairwise.complete.obs")
data[, !apply(mycor, 2, function (x) max(x[-which.max(x)]) >.4 | min(x[which.min(x)]) < -.4) ]

Upvotes: 2

thelatemail
thelatemail

Reputation: 93813

Calculate the correlation matrix cd, checking if there is anything >0.4. Then subset away, ignoring the diagonals, where row==col:

cd <- abs(cor(data, use="pairwise.complete.obs")) > 0.4
data[-unique(col(cd)[cd & row(cd) != col(cd)])]

Upvotes: 3

Related Questions