Kay0818
Kay0818

Reputation: 79

Calculate grouped pairwise correlation between two dataframe in R

I have two data frames with the same structure:

df1 <- data.frame(group1=c("A","A","A","B","B","C","C","C"), 
group2 = c(1,1,2,1,1,2,2,1), 
col1 = c(1,2,3,4,5,6,7,8), 
col2 = c(3,5,7,4,3,7,2,7))

df2 <- data.frame(group1=c("A","A","A","B","B","C","C","C"), 
group2 = c(1,1,2,1,1,2,2,1), 
col1 = c(6,2,7,5,2,5,7,7), 
col2 = c(7,2,5,21,6,9,4,2))

The first two columns in the two data frames are the same. I want to calculate correlation between the columns with the same name (i.e. correlation between col1 in df1 and col1 in df2).

Expected outcome:

group1 group2 col1 correlation col2 correlation
A 1 0.1 0.5
A 2 0.05 0.04
B 1 0.46 0.2

The code below should do the work. Yet, since there are more than two columns to be correlated in the actual data frame. It's quite painful to type all those column names.. Is there any smart way to do this? Thanks in advance!

df <- data.frame(df1,df2) %>% group_by(df1.group1,df2.group2) 
%>% mutate(col1_cor = cor(df1.col1,df2.col1), col2_cor = cor(df1.col2,df2.col2)) %>% 
select(df1.group1,df1.group2,col1_cor,col2_cor)

Upvotes: 0

Views: 835

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389275

You can row bind the two dataframes with an id variable to differentiate between them and calculate correlation for every col column.

library(dplyr)

bind_rows(df1, df2, .id = 'id') %>%
  group_by(group1, group2) %>%
  summarise(across(starts_with('col'), 
       ~cor(.x[id == 1], .x[id == 2]), .names = '{col}_cor'), .groups = 'drop')

Upvotes: 2

Related Questions