Reputation: 311
I have a correlation matrix (called correl
)that is 390 x 390
so I would like to scan for values that are within 0.80
& 0.99
. I have written the following loop:
cc1 <- NA #creates a NA vector to store values between 0.80 & 0.99
cc2 <- NA #creates a NA vector to store desired values
p <- dim(correl)[2] #dim returns the size of the correlation matrix
i =1
while (i <= p) {
cc1 <- correl[,correl[,i] >=0.80 & correl[,i] < 1.00]
cc2<- cbind(cc2,cc1)
i <- i +1
}
The problem I am having is that I also get undesired correlations ( those below 0.80) into cc2
.
#Sample of what I mean:
SPY.Adjusted AAPL.Adjusted CHL.Adjusted CVX.Adjusted
1 SPY.Adjusted 1.0000000 0.83491778 0.6382930 0.8568000
2 AAPL.Adjusted 0.8349178 1.00000000 0.1945304 0.1194307
3 CHL.Adjusted 0.6382930 0.19453044 1.0000000 0.2991739
4 CVX.Adjusted 0.8568000 0.11943067 0.2991739 1.0000000
5 GE.Adjusted 0.6789054 0.13729877 0.3356743 0.5219169
6 GOOGL.Adjusted 0.5567947 0.10986655 0.2552149 0.2128337
I only want to return the correlations within the desired range ( 0.80 & 0.99) without losing the row.names
or col.names
as I would not know which are which.
Upvotes: 2
Views: 1422
Reputation: 3174
Glad you found an answer, but here's another that puts the results in a tidy data frame just in case others are looking for this.
This solution uses the corrr
package (and using dplyr
functions that are attached with it):
library(corrr)
mtcars %>%
correlate() %>%
shave() %>%
stretch(na.rm = TRUE) %>%
filter(between(r, .8, .99))
#> # A tibble: 3 × 3
#> x y r
#> <chr> <chr> <dbl>
#> 1 cyl disp 0.9020329
#> 2 cyl hp 0.8324475
#> 3 disp wt 0.8879799
Explanation:
mtcars
is the data.correlate()
creates a correlation data frame.shave()
is optional and removes the upper triangle (to remove duplicates).stretch()
converts the data frame (in matrix format) to a long format.filter(between(r, .8, .99))
selects only the correlations between .8 and .99Upvotes: 1
Reputation: 3380
When I understood your problem correctly, one wouldn't expect a symmetric matrix as return object. For every variable of yours, you want to extract the other variables that are highly correlated with it - but this amount differs from variable to variable, so you cannot work with a matrix.
If you insist on a matrix/data frame, I would rather replace small correlations with NA
correl[correl<0.8] <- NA
and then access the column names for highly correlated with variable (e.g. in the first row) like this
colnames(correl)[!is.na(correl[1,])]
(Although then the NA step is kind of useless, as you could access the colnames straight with the constraint
colnames(correl)[correl[1,]>0.8)]
)
Upvotes: 0
Reputation: 60452
Let's create a simple reproducible example
m = matrix(runif(100), ncol=10)
rownames(m) = LETTERS[1:10]
colnames(m) = rownames(m)
The tricky part is getting a nice return structure that contains the variable names. So I would collapse the matrix into a standard data frame
dd = data.frame(cor = as.vector(m1),
id1=rownames(m),
id2=rep(rownames(m), each=nrow(m)))
Remove duplicate entries
dd = dd[as.vector(upper.tri(m, TRUE)),]
Then select as usual
dd[dd$cor > 0.8 & dd$cor < 0.99,]
Upvotes: 3