Ssank
Ssank

Reputation: 3677

merging R dataframes on mulitple columns

I have a R dataframe df_big

Candidate  Status
A           1
B           10
C           12
D           15
E           25

and so on

I have a second dataframe df_small

Candidate_1    Candidate_2
A                C   
B                E    
C                D

I want to merge df_small and df_big to get df_final which looks like

Candidate_1    Candidate_2       Status_1     Status_2
A                C                  1           12   
B                E                  10          25
C                D                  12           15

I tried something to the effect

df_small_1 = merge(x=df_small,y = df_big,by.x = "Candidate_1",by.y="Candidate") 

df_small_2 = merge(x=df_small,y = df_big,by.x = "Candidate_2",by.y="Candidate") 

but I do not know how to combine df_small_1 and df_small_2 to df_small

Upvotes: 0

Views: 74

Answers (2)

discipulus
discipulus

Reputation: 2725

Merging is an expensive operation. You could better do this without the need for merge operation using a combination of which and indexing. I have benchmarked on the merge and non-merge solution. The answer also gives the order of columns exactly as needed.

doit <- function(df_small, df_big) 
{

  # Which elements do we need to copy
  indx1 <- df_big[["Candidate"]]  %in% df_small[["Candidate_1"]]

  indx2 <- df_big[["Candidate"]]  %in% df_small[["Candidate_2"]]

  # Copy them
  df_needed <- data.frame(Candiate_1 = df_big[indx1, "Candidate"], Candiate_2 = df_big[indx2, "Candidate"],
                          Status_1 = df_big[indx1, "Status"], Status_2 = df_big[indx2, "Status"])

}

#merge two times
doit_merge <- function(df_small, df_big) 
{
  df_result <- merge(x=df_small,  y=df_big, by.x="Candidate_1", by.y="Candidate") 
  df_result <- merge(x=df_result, y=df_big, by.x="Candidate_2", by.y="Candidate") 
}

library(microbenchmark)

# benchmark results
microbenchmark(
  doit(df_small, df_big) ,
  doit_merge(df_small, df_big) 
)

RESULTS

Unit: microseconds
expr                              min       lq     mean    median        uq      max   neval cld
doit(df_small, df_big)        676.570  758.472 1077.203  834.0115  978.9315 4591.473   100    a 
doit_merge(df_small, df_big) 1329.327 1449.205 1986.995 1612.3940 2021.9070 5966.780   100    b

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522626

You need to join twice, once for each of the two candidates' status:

df_result <- merge(x=df_small,  y=df_big, by.x="Candidate_1", by.y="Candidate") 
df_result <- merge(x=df_result, y=df_big, by.x="Candidate_2", by.y="Candidate") 

Upvotes: 2

Related Questions