Reputation: 18585
I working with two data frames corresponding to the sample below:
# Data sets
set.seed(1)
dta_a <- data.frame(some_value = runif(n = 10),
identifier=c("A0001","A0002","A0003","A0004","A0005",
"A0006","B0001","B0002","B0003","B0004"),
other_val = runif(n = 10))
dta_b <- data.frame(variable_abc = runif(n = 6),
identifier=c("A0001","A0002","A0003,A0004,A0005,C0001",
"B0001,B0002","B0003","B0004"),
variable_df = runif(n = 6))
I would like to merge those two data frames and obtain a data frame similar to the one presented below:
The resulting data frame would have the following qualities:
merge
command performs with all.y = TRUE
and all.x = FALSE
assuming that y
is dta_b
.dta_a
is taken with the remaining values ignored. If there is no match on the first identifier (A0003
) I would like for the command to attempt to match the next one (A0004
).I made a reference to the merge
command but, naturally, dplyr
and other solutions are fine.
Upvotes: 2
Views: 85
Reputation: 1481
you can 'melt' the dta_b so to have one row per identifier with a preference order and then join all the identifiers:
library(dplyr)
library(tidyr)
melt_dta_b = lapply(1:nrow(dta_b), function(i){
split_identifier = strsplit(as.character(dta_b$identifier[i]), split = ",", fixed = TRUE)[[1]]
data_frame(identifier = split_identifier,
original_identifier = dta_b$identifier[i], original_row = i, preference = 1:length(identifier),
variable_abc = dta_b$variable_abc[i], variable_df = dta_b$variable_df[i])
})
melt_dta_b = rbind_all(melt_dta_b)
At that point you can select only the one with the highest preference score:
joined_df = left_join(melt_dta_b, dta_a) %>%
filter(!is.na(some_value)) %>%
group_by(original_row) %>%
filter(preference == min(preference)) %>%
ungroup()
UPDATE
in order to not explicitly call the variables by name you can use the following code that binds all the 'unused' columns of the orginal df:
melt_dta_b = lapply(1:nrow(dta_b), function(i){
tmp = dta_b[i,]
split_identifier = strsplit(as.character(tmp$identifier), split = ",", fixed = TRUE)[[1]]
colnames(tmp)[2] = "original_identifier"
data_frame(identifier = split_identifier, original_row = i, preference = 1:length(identifier)) %>%
cbind(tmp)
})
melt_dta_b = rbind_all(melt_dta_b)
Upvotes: 3
Reputation: 869
Just one way of doing it, but not best way I guess. Just made a try. Split the identifiers and merge according to the first one.
dta_a$identifier = as.vector(dta_a$identifier)
dta_a1 = data.frame(dta_a, identifier_split = do.call(rbind, strsplit(dta_a$identifier, split = ",", fixed = T)))
dta_b$identifier = as.vector(dta_b$identifier)
dta_b1 = data.frame(dta_b, identifier_split = do.call(rbind, strsplit(dta_b$identifier, split = ",", fixed = T)))
dta_join = merge(dta_a1, dta_b1, by = "identifier_split.1", all.x = F, all.y = T)
In cases you don't have a match for the first one, you'll see NAs and you can subset them and merge with second ones ("identifier_split.2")
Upvotes: 1