Reputation: 1610
I'm trying to search for highly correleated variables. My current code is:
which(cor(numericData)<1&cor(numericData)>0.8,arr.ind = TRUE)
and this gives the output:
row col
VarAlice 20 5
VarBob 11 10
VarCoco 10 11
Year 5 20
I have a number of problems with this:
cor(numericData)
more
than once. I would've liked to type something like
0.8<cor(numericData)<1
.The output given does not tell me the names of the variables that are correlated, meaning that I'll have to cross-reference this output with the massive original dataset.
Feeding this output back to cor(numericData)
, i.e.
cor(numericData)[which(cor(numericData)<1&cor(numericData)>0.8,arr.ind= TRUE)]
is rather ugly and loses all information about what rows/columns the data came from, and just spits out the correlation coefficients.
Is there a better way? My ideal output would be a subset of cor(numericData)
that shows only the relevant correlation coefficients and has the row/column names needed to identify them. In this specific case it's clear that varAlice
appears to correlate strongly with Year
, but that would have been much harder to see if I had 50 more variables, as my use case does.
Upvotes: 1
Views: 1516
Reputation: 887048
A better option is to create a temporary object with the cor
output
tmp <- cor(numericData)
use that object to get the row/column index and subset the rows/columns
rc <- which(tmp < 1 & tmp > 0.8, arr.ind = TRUE)
out <- data.frame(rn = row.names(tmp)[rc[,1]], cn = colnames(tmp)[rc[,2]])
and remove the 'tmp'
rm(tmp)
Or another option without creating any temporary object is to convert to data.frame
after creating the table
class, and subset
the data.frame based on the values in 'Freq' column
subset(as.data.frame.table(cor(numericData)), Freq < 1 & Freq > 0.8)
A reproducible example with mtcars
subset(as.data.frame.table(cor(mtcars)), Freq < 1 & Freq > 0.8)
# Var1 Var2 Freq
#14 disp cyl 0.9020329
#15 hp cyl 0.8324475
#24 cyl disp 0.9020329
#28 wt disp 0.8879799
#35 cyl hp 0.8324475
#58 disp wt 0.8879799
Or with between
library(dplyr)
as.data.frame.table(cor(mtcars)) %>%
filter(data.table::between(Freq, 0.8, 1, incbounds = FALSE))
# Var1 Var2 Freq
#1 disp cyl 0.9020329
#2 hp cyl 0.8324475
#3 cyl disp 0.9020329
#4 wt disp 0.8879799
#5 cyl hp 0.8324475
#6 disp wt 0.8879799
Upvotes: 1