Kay0818
Kay0818

Reputation: 79

Calculate correlation for two data frames for all columns after group_by in R

Sample data:

A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))

Expected output: Expected output (x is the correlation for the column between two data frames in the region)

I have tried: Binding two data frames into one and calculate correlation between two columns in one data frame. It is a bit tedious to type every column names, which also creates too many columns. Is there a simpler way to do this?

Upvotes: 1

Views: 208

Answers (2)

jpdugo17
jpdugo17

Reputation: 7116

This is a little convoluted but it's an alternative way to do it.

library(tidyverse)

A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))

(df <- map(list(A, B), ~nest_by(.x, region)) %>% 
       reduce(inner_join, by = 'region')) 
#> # A tibble: 3 × 3
#> # Rowwise:  region
#>   region             data.x             data.y
#>   <chr>  <list<tibble[,3]>> <list<tibble[,3]>>
#> 1 AUS               [2 × 3]            [2 × 3]
#> 2 UK                [2 × 3]            [2 × 3]
#> 3 US                [2 × 3]            [2 × 3]

bind_cols(select(df, region), map2_dfr(df$data.x, df$data.y, ~map2_dfc(.x, .y, ~cor(.x, .y))))
#> # A tibble: 3 × 4
#> # Rowwise:  region
#>   region     a     b     c
#>   <chr>  <dbl> <dbl> <dbl>
#> 1 AUS        1    -1     1
#> 2 UK         1     1    -1
#> 3 US        -1    -1     1

Created on 2022-01-06 by the reprex package (v2.0.1)

Upvotes: 0

ekoam
ekoam

Reputation: 8844

If my understanding is not off, then here is a solution using dplyr and tidyr.

library(dplyr)
library(tidyr)

rbind(cbind(set = "A", A), cbind(set = "B", B)) %>% 
  pivot_longer(-c(set, region)) %>% 
  group_by(region, name) %>% 
  summarise(value = cor(value[set == "A"], value[set == "B"]), .groups = "drop") %>% 
  pivot_wider()

Output

# A tibble: 3 x 4
  region     a     b     c
  <chr>  <dbl> <dbl> <dbl>
1 AUS        1    -1     1
2 UK         1     1    -1
3 US        -1    -1     1

Upvotes: 2

Related Questions