user3030872
user3030872

Reputation: 477

Aggregated Correlation (R::dplyr)

I'm trying to calculate a correlation matrix at various subsettings of a data frame. I found this snippet of code for calculating correlation between 2 variables in the data frame:

library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp,wt))

But I would like to calculate a correlation matrix between several variables in the data frame. I would like this to be returned (preferably) as a list of correlation matrixes. Something like:

mtcars %>% group_by(cyl) %>% cor(data.frame(hp,wt,qsec)

Can I do that with dplyr?

Upvotes: 0

Views: 1040

Answers (2)

jackinovik
jackinovik

Reputation: 869

This is an old question, but I'm updating here in case it's helpful to folks.

You can use the functions in the purrr package to transform the tibble containing correlation matrices to a list of objects that can be further manipulated.

Specifically, to expand on the answer provided by @mathematical.coffee:

library(tidyverse)
data(mtcars)

mtcars %>% 
  dplyr::group_by(cyl) %>% 
  dplyr::do(cor = cor(cbind(.$hp, .$wt, .$qsec))) %>%
  purrr::transpose() %>%     # <- converts tibble to a row-wise list
  purrr::set_names(nm = purrr::map(., 'cyl')) %>%  # <- use `cyl` as item name
  purrr::map('cor')      # <- extract `cor` from each list item

The result is a list of correlation matrices:

$`4`
           [,1]      [,2]       [,3]
[1,]  1.0000000 0.1598761 -0.1783611
[2,]  0.1598761 1.0000000  0.6380214
[3,] -0.1783611 0.6380214  1.0000000

$`6`
           [,1]       [,2]       [,3]
[1,]  1.0000000 -0.3062284 -0.6280148
[2,] -0.3062284  1.0000000  0.8659614
[3,] -0.6280148  0.8659614  1.0000000

$`8`
            [,1]       [,2]       [,3]
[1,]  1.00000000 0.01761795 -0.7554985
[2,]  0.01761795 1.00000000  0.5365487
[3,] -0.75549854 0.53654866  1.0000000

The key part of this is the purrr::transpose() function, which casts the tibble to a list of columns before transposing it to a list of rows.

Upvotes: 0

mathematical.coffee
mathematical.coffee

Reputation: 56955

In my opinion good old by or dlply is better here, but if you really want to use dplyr, I think you can use do:

o <- mtcars %>% group_by(cyl) %>% do(cor=cor(cbind(.$hp, .$wt, .$qsec)))
# Source: local data frame [3 x 2]
# Groups: <by row>

#   cyl        cor
# 1   4 <dbl[3,3]>
# 2   6 <dbl[3,3]>
# 3   8 <dbl[3,3]>

where the . refers to the filtered dataframe. Then you could do o$cor[1] etc. I'm unsure how to just get a list output from dplyr rather than a dataframe output.


Using plyr:

library(plyr)
dlply(mtcars, .(cyl), function (x) cor(x[, c('hp', 'wt', 'qsec')]))

Using base R and by:

o <- by(mtcars[, c('hp', 'wt', 'qsec')], mtcars$cyl, cor, simplify=F)

o is of class by, but ?by says this is basically a list.

length(o) # 3
names(o) # "4" "6" "8" (i.e. the cyl values)
o[[1]] # =cor(hp, wt, qsec) where cyl==4

Upvotes: 3

Related Questions