Reputation: 405
How do you merge two data tables (or data frames) in R
keeping the non-NA
values from each matching column? The question Merge data frames and overwrite values provides a solution if each individual column is specified explicitly (as far as I can tell, at least). But, I have over 40 common columns between the two data tables, and it is somewhat random which of the two has an NA
versus a valid value. So, writing ifelse
statements for 40 columns seems inefficient.
Below is a simple example, where I'd like to join (merge
) the two data.table
s by the id
and date
columns:
dt_1 <- data.table::data.table(id = "abc",
date = "2018-01-01",
a = 3,
b = NA_real_,
c = 4,
d = 6,
e = NA_real_)
setkey(dt_1, id, date)
> dt_1
id date a b c d e
1: abc 2018-01-01 3 NA 4 6 NA
dt_2 <- data.table::data.table(id = "abc",
date = "2018-01-01",
a = 3,
b = 5,
c = NA_real_,
d = 6,
e = NA_real_)
setkey(dt_2, id, date)
> dt_2
id date a b c d e
1: abc 2018-01-01 3 5 NA 6 NA
Here is my desired output:
> dt_out
id date a b c d e
1: abc 2018-01-01 3 5 4 6 NA
I've also tried the dplyr::anti_join
solution from left_join two data frames and overwrite without success.
Upvotes: 0
Views: 1871
Reputation: 47340
We can use {powerjoin}, do a left join and deal with the conflicts using coalesce_xy()
(which is pretty much dplyr::coalesce()
).
library(powerjoin)
power_left_join(dt_1, dt_2, by = "id", conflict = coalesce_xy)
# id date a b c d e
# 1 abc 2018-01-01 3 5 4 6 NA
Upvotes: 1
Reputation: 228
You can do this by using dplyr::coalesce
, which will return the first non-missing value from vectors.
(EDIT: you can use dplyr::coalesce
directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)
Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x
data frame if they are present).
coalesce_join <- function(x,
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
Upvotes: 1
Reputation: 66819
I'd probably put the data in long form and drop dupes:
k = key(dt_1)
DTList = list(dt_1, dt_2)
DTLong = rbindlist(lapply(DTList, function(x) melt(x, id=k)))
setorder(DTLong, na.last = TRUE)
unique(DTLong, by=c(k, "variable"))
id date variable value
1: abc 2018-01-01 a 3
2: abc 2018-01-01 b 5
3: abc 2018-01-01 c 4
4: abc 2018-01-01 d 6
5: abc 2018-01-01 e NA
Upvotes: 2