deschen
deschen

Reputation: 10996

tidyverse - Correlations among multiple columns grouped by other column

I want to get a correlation matrix across several variables in tidyverse. However, I want to do this grouped by another column. E.g. suppose I have a data frame df with columns year and I want to see correlations across V1, V2, V3by year.

year    V1    V2    V3    misc_var
2018    5     6     5     a
2018    4     6     4     b
2018    3     2     3     NA
2013    5     8     2     4
2013    6     3     8     8
2013    4     7     5     NA

I tried sth. along the lines of

cor_output = df %>%
  group_by(year) %>%
  select(V1, V2, V3, year) %>%
  cor(use = "pairwise.complete.obs")

However, instead of calculating the correlations from V1 to V3 for each year, it just adds the year variable to the correlations.

The desired output should look like (please note the correlations in the output are made up)

year    var    V1    V2    V3
2013    V1     1    0.7    0.3
2013    V2     ...    1    ...
...
...
2018    V2    0.6    1    0.7
...

Any thoughts?

Upvotes: 5

Views: 2315

Answers (2)

gradcylinder
gradcylinder

Reputation: 410

In general terms:

dataframe %>%
  select(grouping_variable, columns) %>%
  group_by(grouping_variable) %>%
  group_modify(~ corrr::correlate(.x))

where columns may be c(col_1, col_2, ...) or col_1:col_10

Upvotes: 1

JasonAizkalns
JasonAizkalns

Reputation: 20463

One approach is to use the corrr package in conjunction with a purrr::nest():

library(tidyverse)
library(corrr)

df <- tribble(
    ~year, ~V1, ~V2, ~V3, ~misc_var,
     2018,   5,   6,   5,       "a",
     2018,   4,   6,   4,       "b",
     2018,   3,   2,   3,        NA,
     2013,   5,   8,   2,       "4",
     2013,   6,   3,   8,       "8",
     2013,   4,   7,   5,        NA
    )

df %>%
  select_if(is.numeric) %>%
  group_by(year) %>%
  nest() %>%
  mutate(
    correlations = map(data, correlate)
  ) %>%
  unnest(correlations)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> 
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 6 x 5
#>    year rowname     V1     V2     V3
#>   <dbl> <chr>    <dbl>  <dbl>  <dbl>
#> 1  2018 V1      NA      0.866  1    
#> 2  2018 V2       0.866 NA      0.866
#> 3  2018 V3       1      0.866 NA    
#> 4  2013 V1      NA     -0.756  0.5  
#> 5  2013 V2      -0.756 NA     -0.945
#> 6  2013 V3       0.5   -0.945 NA

Alternatively, you could use the more experimental group_map or group_modify functions from dplyr:

df %>%
  select_if(is.numeric) %>%
  group_by(year) %>%
  group_map(~ correlate(.x))      # or group_modify(~ correlate(.x))

Upvotes: 4

Related Questions