Matching across two data frames with certain observations having multiple entries to match against

Question

I working with two data frames corresponding to the sample below:

# Data sets
set.seed(1)
dta_a <- data.frame(some_value = runif(n = 10), 
                    identifier=c("A0001","A0002","A0003","A0004","A0005",
                                 "A0006","B0001","B0002","B0003","B0004"),
                    other_val = runif(n = 10))
dta_b <- data.frame(variable_abc = runif(n = 6), 
                    identifier=c("A0001","A0002","A0003,A0004,A0005,C0001",
                                 "B0001,B0002","B0003","B0004"),
                    variable_df = runif(n = 6))

I would like to merge those two data frames and obtain a data frame similar to the one presented below:

The resulting data frame would have the following qualities:

For the observations where only one identifier is present the merge command performs with all.y = TRUE and all.x = FALSE assuming that y is dta_b.
For the observations where multiple identifiers are provided only the first matched value from the dta_a is taken with the remaining values ignored. If there is no match on the first identifier (A0003) I would like for the command to attempt to match the next one (A0004).

I made a reference to the merge command but, naturally, dplyr and other solutions are fine.

Lorenzo Rossi · Accepted Answer

you can 'melt' the dta_b so to have one row per identifier with a preference order and then join all the identifiers:

library(dplyr)
library(tidyr)
melt_dta_b = lapply(1:nrow(dta_b), function(i){
 split_identifier = strsplit(as.character(dta_b$identifier[i]), split = ",", fixed = TRUE)[[1]]
 data_frame(identifier = split_identifier, 
         original_identifier = dta_b$identifier[i], original_row = i, preference = 1:length(identifier), 
         variable_abc = dta_b$variable_abc[i], variable_df = dta_b$variable_df[i])
})
melt_dta_b = rbind_all(melt_dta_b)

At that point you can select only the one with the highest preference score:

joined_df = left_join(melt_dta_b, dta_a) %>% 
  filter(!is.na(some_value)) %>% 
  group_by(original_row) %>% 
  filter(preference == min(preference)) %>% 
  ungroup()

UPDATE

in order to not explicitly call the variables by name you can use the following code that binds all the 'unused' columns of the orginal df:

melt_dta_b = lapply(1:nrow(dta_b), function(i){
  tmp = dta_b[i,]
  split_identifier = strsplit(as.character(tmp$identifier), split = ",", fixed = TRUE)[[1]]
  colnames(tmp)[2] = "original_identifier"
  data_frame(identifier = split_identifier, original_row = i, preference = 1:length(identifier)) %>% 
    cbind(tmp)
})
melt_dta_b = rbind_all(melt_dta_b)

Matching across two data frames with certain observations having multiple entries to match against

Answers (2)

Related Questions