J. Mini
J. Mini

Reputation: 1610

Is there a cleaner way to subset correlation matrices?

I'm trying to search for highly correleated variables. My current code is: which(cor(numericData)<1&cor(numericData)>0.8,arr.ind = TRUE) and this gives the output:

             row col
VarAlice     20   5
VarBob       11  10
VarCoco      10  11
Year          5  20

I have a number of problems with this:

  1. To get this output, I've had to type out cor(numericData) more than once. I would've liked to type something like 0.8<cor(numericData)<1.
  2. The output given does not tell me the names of the variables that are correlated, meaning that I'll have to cross-reference this output with the massive original dataset.

  3. Feeding this output back to cor(numericData), i.e. cor(numericData)[which(cor(numericData)<1&cor(numericData)>0.8,arr.ind= TRUE)] is rather ugly and loses all information about what rows/columns the data came from, and just spits out the correlation coefficients.

Is there a better way? My ideal output would be a subset of cor(numericData) that shows only the relevant correlation coefficients and has the row/column names needed to identify them. In this specific case it's clear that varAlice appears to correlate strongly with Year, but that would have been much harder to see if I had 50 more variables, as my use case does.

Upvotes: 1

Views: 1516

Answers (1)

akrun
akrun

Reputation: 887048

A better option is to create a temporary object with the cor output

tmp <- cor(numericData)

use that object to get the row/column index and subset the rows/columns

rc <- which(tmp < 1 & tmp > 0.8, arr.ind = TRUE)
out <- data.frame(rn = row.names(tmp)[rc[,1]], cn = colnames(tmp)[rc[,2]]) 

and remove the 'tmp'

rm(tmp)

Or another option without creating any temporary object is to convert to data.frame after creating the table class, and subset the data.frame based on the values in 'Freq' column

subset(as.data.frame.table(cor(numericData)), Freq < 1 & Freq > 0.8)

A reproducible example with mtcars

subset(as.data.frame.table(cor(mtcars)), Freq < 1 & Freq > 0.8)
#   Var1 Var2      Freq
#14 disp  cyl 0.9020329
#15   hp  cyl 0.8324475
#24  cyl disp 0.9020329
#28   wt disp 0.8879799
#35  cyl   hp 0.8324475
#58 disp   wt 0.8879799

Or with between

library(dplyr)
as.data.frame.table(cor(mtcars)) %>% 
     filter(data.table::between(Freq, 0.8, 1, incbounds = FALSE))
# Var1 Var2      Freq
#1 disp  cyl 0.9020329
#2   hp  cyl 0.8324475
#3  cyl disp 0.9020329
#4   wt disp 0.8879799
#5  cyl   hp 0.8324475
#6 disp   wt 0.8879799

Upvotes: 1

Related Questions