How to find common rows (considering vice versa format) of 2 dataframe in R

Question

I want to find the common rows between 2 dataframe. To find the common rows, I can use inner_join(), semi_join(), and merge(). I have gone through different posts including this. But, these operations are not fulfilling my purposes. Because my data in the dataframe is a little different!

Sometimes, the data in the dataframe can be vise versa. Like the 3rd and 5th rows of dataframe-1 and dataframe-2. Dataframe-1 contains A3 A1 0.75 but Dataframe-2 contains A1 A3 0.75 . I would like to take these 2 rows as the same.

My first dataframe looks like

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50
3    A3     A1   0.75
4    A4     A5   0.88
5    A5     A3   0.99
6    (+)-1(10),4-Cadinadiene     Falcarinone-10     0.09
7    Leucodelphinidin-100    (+)-1(10),4-Cadinadiene     0.876
8    Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
9    (2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
10   Falcarinone     Lignin  1
11   A1  (+)-1(10),4-Cadinadiene     1
12   A2  Lignin-10  1
13   A3  (2E,7R,11R)-2-Phyten-1-ol   1
14   Falcarinone  A6    1
15   A4  Leucodelphinidin    1
16   A4  Leucodelphinidin    1
17   Falcarinone  A100    1
18   A4  Falcarinone     1

the second dataframe looks like

  query target
1    A1     A2   
2    A2     A5   
3    A1     A3  // Missing in the output
4    A4     A5   
5    A3     A5  // Missing in the output
6    A3  (2E,7R,11R)-2-Phyten-1-ol   
7    (+)-1(10),4-Cadinadiene     Falcarinone    
8    Leucodelphinidin    (+)-1(10),4-Cadinadiene-100    
9    Lignin-2  (2E,7R,11R)-2-Phyten-1-ol   
10   A11  (+)-1(10),4-Cadinadiene    
11   A2  Lignin  
12   A3  (2E,7R,11R)-2-Phyten-1-0l 
13   Falcarinone  A60    
14   A4  Leucodelphinidin  // Missing in the output

The code I am using output <- semi_join(Dataframe-1, Dataframe-2) OR output <- inner_join(df_only_dd, sample_data_dd_interaction)

The output I am getting

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50

But, my expected output is like this

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50
3    A3     A1   0.75
4    A4     A5   0.88
5    A5     A3   0.99
6    A4  Leucodelphinidin  1

Reproducible code is given below

df_1 <- read.table(text="query   target     weight
A1  A2  0.6
A2  A5  0.5
A3  A1  0.75
A4  A5  0.88
A5  A3  0.99
(+)-1(10),4-Cadinadiene     Falcarinone     0.09
Leucodelphinidin    (+)-1(10),4-Cadinadiene     0.876
Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
(2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
Falcarinone     Lignin  1
A1  (+)-1(10),4-Cadinadiene     1
A2  Lignin  1
A3  (2E,7R,11R)-2-Phyten-1-ol   1
Falcarinone  A6    1
A4  Leucodelphinidin    1
A4  Leucodelphinidin    1
Falcarinone  A100    1
A5  Falcarinone     1", header=TRUE)

df_2 <- read.table(text="query   target
A1  A2 
A2  A5
A1  A3  
A4  A5  
A3  A5  
(+)-1(10),4-Cadinadiene     Falcarinone    
Leucodelphinidin    (+)-1(10),4-Cadinadiene-100    
Lignin-2  (2E,7R,11R)-2-Phyten-1-ol   
A11  (+)-1(10),4-Cadinadiene    
A2  Lignin  
A3  (2E,7R,11R)-2-Phyten-1-0l 
Falcarinone  A6    
A4  Leucodelphinidin  ", header=TRUE)

Any kind of suggestion is appreciated.

jay.sf · Accepted Answer

You could write a small function that sorts rows of first two columns of both data frames, then merge them.

sc <- function(x, i) setNames(cbind(data.frame(t(apply(x[i], 1, sort))), x[-i]), names(x))

res <- merge(sc(df_1, 1:2), sc(df_2, 1:2))
res[!duplicated(res), ]  ## remove duplicates
#                      query           target weight
# 1  (+)-1(10),4-Cadinadiene      Falcarinone   0.09
# 2                       A1               A2   0.60
# 3                       A1               A3   0.75
# 4                       A2               A5   0.50
# 5                       A2           Lignin   1.00
# 6                       A3               A5   0.99
# 7                       A4               A5   0.88
# 8                       A4 Leucodelphinidin   1.00
# 10                      A6      Falcarinone   1.00

Edit

Solution with data.table which should be more memory efficient.

library(data.table)
setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
setDT(df_2)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
res <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
res
#                      query           target weight
# 1: (+)-1(10),4-Cadinadiene      Falcarinone   0.09
# 2:                      A1               A2   0.60
# 3:                      A1               A3   0.75
# 4:                      A2               A5   0.50
# 5:                      A2           Lignin   1.00
# 6:                      A3               A5   0.99
# 7:                      A4               A5   0.88
# 8:                      A4 Leucodelphinidin   1.00
# 9:                      A6      Falcarinone   1.00

To get back "data.frame"s, just do e.g. setDF(res).

How to find common rows (considering vice versa format) of 2 dataframe in R

Answers (2)

Edit

Related Questions