Ursus Frost
Ursus Frost

Reputation: 405

Merge R data frame or data table and overwrite values of multiple columns

How do you merge two data tables (or data frames) in R keeping the non-NA values from each matching column? The question Merge data frames and overwrite values provides a solution if each individual column is specified explicitly (as far as I can tell, at least). But, I have over 40 common columns between the two data tables, and it is somewhat random which of the two has an NA versus a valid value. So, writing ifelse statements for 40 columns seems inefficient.

Below is a simple example, where I'd like to join (merge) the two data.tables by the id and date columns:

dt_1 <- data.table::data.table(id = "abc",
                               date = "2018-01-01",
                               a = 3, 
                               b = NA_real_,
                               c = 4, 
                               d = 6,
                               e = NA_real_)
setkey(dt_1, id, date)

> dt_1
    id       date a  b c d  e
1: abc 2018-01-01 3 NA 4 6 NA

dt_2 <- data.table::data.table(id = "abc", 
                               date = "2018-01-01",
                               a = 3, 
                               b = 5,
                               c = NA_real_, 
                               d = 6,
                               e = NA_real_)
setkey(dt_2, id, date)
> dt_2
    id       date a b  c d  e
1: abc 2018-01-01 3 5 NA 6 NA

Here is my desired output:

> dt_out
    id       date a b c d  e
1: abc 2018-01-01 3 5 4 6 NA

I've also tried the dplyr::anti_join solution from left_join two data frames and overwrite without success.

Upvotes: 0

Views: 1871

Answers (3)

moodymudskipper
moodymudskipper

Reputation: 47340

We can use {powerjoin}, do a left join and deal with the conflicts using coalesce_xy() (which is pretty much dplyr::coalesce()).

library(powerjoin)
power_left_join(dt_1, dt_2, by = "id", conflict = coalesce_xy)
#    id       date a b c d  e
# 1 abc 2018-01-01 3 5 4 6 NA

Upvotes: 1

Hamed
Hamed

Reputation: 228

You can do this by using dplyr::coalesce, which will return the first non-missing value from vectors.

(EDIT: you can use dplyr::coalesce directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)

Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x data frame if they are present).

coalesce_join <- function(x, 
                          y, 
                          by, 
                          suffix = c(".x", ".y"), 
                          join = dplyr::full_join, ...) {
    joined <- join(x, y, by = by, suffix = suffix, ...)
    # names of desired output
    cols <- union(names(x), names(y))

    to_coalesce <- names(joined)[!names(joined) %in% cols]
    suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
    # remove suffixes and deduplicate
    to_coalesce <- unique(substr(
        to_coalesce, 
        1, 
        nchar(to_coalesce) - nchar(suffix_used)
    ))

    coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
        joined[[paste0(.x, suffix[1])]], 
        joined[[paste0(.x, suffix[2])]]
    ))
    names(coalesced) <- to_coalesce

    dplyr::bind_cols(joined, coalesced)[cols]
}

Upvotes: 1

Frank
Frank

Reputation: 66819

I'd probably put the data in long form and drop dupes:

k = key(dt_1)
DTList = list(dt_1, dt_2)

DTLong = rbindlist(lapply(DTList, function(x) melt(x, id=k)))    
setorder(DTLong, na.last = TRUE)    
unique(DTLong, by=c(k, "variable"))

    id       date variable value
1: abc 2018-01-01        a     3
2: abc 2018-01-01        b     5
3: abc 2018-01-01        c     4
4: abc 2018-01-01        d     6
5: abc 2018-01-01        e    NA

Upvotes: 2

Related Questions