Calculate correlation for two data frames for all columns after group_by in R

Question

Sample data:

A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))

Expected output: (x is the correlation for the column between two data frames in the region)

I have tried: Binding two data frames into one and calculate correlation between two columns in one data frame. It is a bit tedious to type every column names, which also creates too many columns. Is there a simpler way to do this?

ekoam · Accepted Answer

If my understanding is not off, then here is a solution using dplyr and tidyr.

library(dplyr)
library(tidyr)

rbind(cbind(set = "A", A), cbind(set = "B", B)) %>% 
  pivot_longer(-c(set, region)) %>% 
  group_by(region, name) %>% 
  summarise(value = cor(value[set == "A"], value[set == "B"]), .groups = "drop") %>% 
  pivot_wider()

Output

# A tibble: 3 x 4
  region     a     b     c
      
1 AUS        1    -1     1
2 UK         1     1    -1
3 US        -1    -1     1

Calculate correlation for two data frames for all columns after group_by in R

Answers (2)

Related Questions