user17621
user17621

Reputation: 49

R - finding the top 10 correlation values within dataset

I have a question regarding my data analysis. Specifically, I have calculated about 100 correlations and would like to see the top 10 correlation values from these 100 calculated correlations. Unfortunately, I'm a bit stuck right now and can't get any further. Can you help me how to output these top 10 correlation values automatically without checking them manually one by one?

The correlation values are calculated like this:

  my_correlation_1 <- function(ticker_subset, data) {
  cor(subset(data, TickerSymbol == ticker_subset, c(Sales, Stockprice_quarterly)))
}

mycor1 <- lapply(unique(dat$TickerSymbol), my_correlation_1, data = dat)
names(mycor1) <- unique(dat$TickerSymbol)

The correlation calculations provide results like this:

# $AMD
#                           Sales Stockprice_quarterly
# Sales                 1.0000000           -0.2261417
# Stockprice_quarterly -0.2261417            1.0000000
# 
# $AAPL
#                          Sales Stockprice_quarterly
# Sales                1.0000000            0.6531391
# Stockprice_quarterly 0.6531391            1.0000000
# 
# $EBAY
#                          Sales Stockprice_quarterly
# Sales                1.0000000            0.2032839
# Stockprice_quarterly 0.2032839            1.0000000

Many thanks in advance!

Upvotes: 1

Views: 236

Answers (1)

r2evans
r2evans

Reputation: 160447

I'll demonstrate using data we have: mtcars.

allcors <- lapply(unique(mtcars$cyl), function(z) cor(subset(mtcars, cyl == z, select = c(mpg, disp))))
allcors
# [[1]]
#        mpg  disp
# mpg  1.000 0.103
# disp 0.103 1.000
# [[2]]
#         mpg   disp
# mpg   1.000 -0.805
# disp -0.805  1.000
# [[3]]
#        mpg  disp
# mpg   1.00 -0.52
# disp -0.52  1.00

In reality, we only need one non-diagonal value of each of these. From that, we can rank the value and move from there.

sapply(allcors, function(z) z[2,1])
# [1]  0.103 -0.805 -0.520
rank(sapply(allcors, function(z) z[2,1]))
# [1] 3 1 2

Indicating that the second value is the lowest ranked value in the bunch. This is using the real value; if you want the rank of the absolute value, use abs(.):

abs(sapply(allcors, function(z) z[2,1]))
# [1] 0.103 0.805 0.520
rank(abs(sapply(allcors, function(z) z[2,1])))
# [1] 1 3 2

From here, if you want the top 2 of these 3 (which would be the top 10 of your n), then we can use which:

which(allranks <= 2)
# [1] 1 3

meaning that the first and third of the original categories (your TickerSymbol) are the lowest rank.

And tying it back to the original categories,

unique(mtcars$cyl)[ which(allranks <= 2) ]
# [1] 6 8

(There are the least-correlated in a sense. Use -rank(.) for the most-correlated.)


Alternatives, starting from scratch with the data and not using lapply:

dplyr

library(dplyr)
mtcars %>%
  group_by(cyl) %>%
  summarize(corr = cor(cbind(mpg, disp))[2,1]) %>%
  slice_max(abs(corr), n=2)
# # A tibble: 2 x 2
#     cyl   corr
#   <dbl>  <dbl>
# 1     4 -0.805
# 2     8 -0.520

data.table

library(data.table)
as.data.table(mtcars)[, .(corr = cor(cbind(mpg, disp))[2,1]), by = cyl
  ][ rank(-abs(corr)) <= 2, ]
#      cyl   corr
#    <num>  <num>
# 1:     4 -0.805
# 2:     8 -0.520

base R

do.call(rbind,
  by(mtcars, mtcars["cyl"],
     FUN = function(z) data.frame(cyl = z$cyl[1], corr = cor(z$mpg, z$disp))
  )
)
#   cyl   corr
# 4   4 -0.805
# 6   6  0.103
# 8   8 -0.520

(which you can then sort/filter like the rest).

Upvotes: 3

Related Questions