Reputation: 477
I'm trying to calculate a correlation matrix at various subsettings of a data frame. I found this snippet of code for calculating correlation between 2 variables in the data frame:
library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp,wt))
But I would like to calculate a correlation matrix between several variables in the data frame. I would like this to be returned (preferably) as a list of correlation matrixes. Something like:
mtcars %>% group_by(cyl) %>% cor(data.frame(hp,wt,qsec)
Can I do that with dplyr?
Upvotes: 0
Views: 1040
Reputation: 869
This is an old question, but I'm updating here in case it's helpful to folks.
You can use the functions in the purrr
package to transform the tibble containing correlation matrices to a list of objects that can be further manipulated.
Specifically, to expand on the answer provided by @mathematical.coffee:
library(tidyverse)
data(mtcars)
mtcars %>%
dplyr::group_by(cyl) %>%
dplyr::do(cor = cor(cbind(.$hp, .$wt, .$qsec))) %>%
purrr::transpose() %>% # <- converts tibble to a row-wise list
purrr::set_names(nm = purrr::map(., 'cyl')) %>% # <- use `cyl` as item name
purrr::map('cor') # <- extract `cor` from each list item
The result is a list of correlation matrices:
$`4`
[,1] [,2] [,3]
[1,] 1.0000000 0.1598761 -0.1783611
[2,] 0.1598761 1.0000000 0.6380214
[3,] -0.1783611 0.6380214 1.0000000
$`6`
[,1] [,2] [,3]
[1,] 1.0000000 -0.3062284 -0.6280148
[2,] -0.3062284 1.0000000 0.8659614
[3,] -0.6280148 0.8659614 1.0000000
$`8`
[,1] [,2] [,3]
[1,] 1.00000000 0.01761795 -0.7554985
[2,] 0.01761795 1.00000000 0.5365487
[3,] -0.75549854 0.53654866 1.0000000
The key part of this is the purrr::transpose() function, which casts the tibble
to a list of columns before transposing it to a list of rows.
Upvotes: 0
Reputation: 56955
In my opinion good old by
or dlply
is better here, but if you really want to use dplyr
, I think you can use do
:
o <- mtcars %>% group_by(cyl) %>% do(cor=cor(cbind(.$hp, .$wt, .$qsec)))
# Source: local data frame [3 x 2]
# Groups: <by row>
# cyl cor
# 1 4 <dbl[3,3]>
# 2 6 <dbl[3,3]>
# 3 8 <dbl[3,3]>
where the .
refers to the filtered dataframe. Then you could do o$cor[1]
etc. I'm unsure how to just get a list output from dplyr rather than a dataframe output.
Using plyr:
library(plyr)
dlply(mtcars, .(cyl), function (x) cor(x[, c('hp', 'wt', 'qsec')]))
Using base R and by
:
o <- by(mtcars[, c('hp', 'wt', 'qsec')], mtcars$cyl, cor, simplify=F)
o
is of class by
, but ?by
says this is basically a list.
length(o) # 3
names(o) # "4" "6" "8" (i.e. the cyl values)
o[[1]] # =cor(hp, wt, qsec) where cyl==4
Upvotes: 3