Jason
Jason

Reputation: 311

Return values from a Correlation Matrix in R

I have a correlation matrix (called correl)that is 390 x 390 so I would like to scan for values that are within 0.80 & 0.99. I have written the following loop:

cc1 <- NA #creates a NA vector to store values between 0.80 & 0.99
cc2 <- NA #creates a NA vector to store desired values
p <- dim(correl)[2] #dim returns the size of the correlation matrix
i =1

while (i <= p) { 
    cc1 <- correl[,correl[,i] >=0.80 & correl[,i] < 1.00]
    cc2<- cbind(cc2,cc1)
    i <- i +1
}

The problem I am having is that I also get undesired correlations ( those below 0.80) into cc2.

#Sample of what I mean:

                   SPY.Adjusted AAPL.Adjusted   CHL.Adjusted    CVX.Adjusted
1   SPY.Adjusted    1.0000000   0.83491778  0.6382930   0.8568000
2   AAPL.Adjusted   0.8349178   1.00000000  0.1945304   0.1194307
3   CHL.Adjusted    0.6382930   0.19453044  1.0000000   0.2991739
4   CVX.Adjusted    0.8568000   0.11943067  0.2991739   1.0000000
5   GE.Adjusted     0.6789054   0.13729877  0.3356743   0.5219169
6   GOOGL.Adjusted  0.5567947   0.10986655  0.2552149   0.2128337

I only want to return the correlations within the desired range ( 0.80 & 0.99) without losing the row.names or col.names as I would not know which are which.

Upvotes: 2

Views: 1422

Answers (3)

Simon Jackson
Simon Jackson

Reputation: 3174

Glad you found an answer, but here's another that puts the results in a tidy data frame just in case others are looking for this.

This solution uses the corrr package (and using dplyr functions that are attached with it):

library(corrr)

mtcars %>% 
  correlate() %>% 
  shave() %>% 
  stretch(na.rm = TRUE) %>% 
  filter(between(r, .8, .99))

#> # A tibble: 3 × 3
#>       x     y         r
#>   <chr> <chr>     <dbl>
#> 1   cyl  disp 0.9020329
#> 2   cyl    hp 0.8324475
#> 3  disp    wt 0.8879799

Explanation:

  • mtcars is the data.
  • correlate() creates a correlation data frame.
  • shave() is optional and removes the upper triangle (to remove duplicates).
  • stretch() converts the data frame (in matrix format) to a long format.
  • filter(between(r, .8, .99)) selects only the correlations between .8 and .99

Upvotes: 1

Daniel Fischer
Daniel Fischer

Reputation: 3380

When I understood your problem correctly, one wouldn't expect a symmetric matrix as return object. For every variable of yours, you want to extract the other variables that are highly correlated with it - but this amount differs from variable to variable, so you cannot work with a matrix. If you insist on a matrix/data frame, I would rather replace small correlations with NA

correl[correl<0.8] <- NA

and then access the column names for highly correlated with variable (e.g. in the first row) like this

colnames(correl)[!is.na(correl[1,])]

(Although then the NA step is kind of useless, as you could access the colnames straight with the constraint colnames(correl)[correl[1,]>0.8)] )

Upvotes: 0

csgillespie
csgillespie

Reputation: 60452

Let's create a simple reproducible example

m = matrix(runif(100), ncol=10)
rownames(m) = LETTERS[1:10]
colnames(m) = rownames(m)

The tricky part is getting a nice return structure that contains the variable names. So I would collapse the matrix into a standard data frame

dd = data.frame(cor = as.vector(m1), 
                     id1=rownames(m), 
                     id2=rep(rownames(m), each=nrow(m)))

Remove duplicate entries

dd = dd[as.vector(upper.tri(m, TRUE)),]

Then select as usual

dd[dd$cor > 0.8 & dd$cor < 0.99,]

Upvotes: 3

Related Questions