MBeck
MBeck

Reputation: 167

Merging two df in r replace missing values

I want to match two similar dataframes in R. Both dfs have partially the same variables and a key (id), which contain missing values:

library(tidyverse)

df1 <- as_tibble(list(id = seq(1:6),
                      v1 = c(1, 0, NA, 1, 0, NA),
                      v2 = c(NA, NA, 0, 0, 1, NA),
                      v3 = c(1, 0 , 1, 1, 1, NA)))
df1
# A tibble: 6 x 4
     id    v1    v2    v3
  <int> <dbl> <dbl> <dbl>
1     1     1    NA     1
2     2     0    NA     0
3     3    NA     0     1
4     4     1     0     1
5     5     0     1     1
6     6    NA    NA    NA

df2 <- as_tibble(list(id = seq(1:6),
                      v1 = c(1, NA, 0, 1, 0, 1),
                      v2 = c(1, 0, 0, NA, 1, 1),
                      v4 = c(0, 1, 0, NA, NA, NA)))

df2
# A tibble: 6 x 4
     id    v1    v2    v4
  <int> <dbl> <dbl> <dbl>
1     1     1     1     0
2     2    NA     0     1
3     3     0     0     0
4     4     1    NA    NA
5     5     0     1    NA
6     6     1     1    NA

I'd like to merge those into one df that looks like this:

     id    v1    v2    v3    v4
  <int> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     0
2     2     0     0     0     1
3     3     0     0     1     0
4     4     1     0     1    NA
5     5     0     1     1    NA
6     6     1     1    NA    NA

So far I've tried all kind of joins from dplyr, but somehow I am not getting it.

Upvotes: 1

Views: 241

Answers (3)

MKa
MKa

Reputation: 2318

Here is another way, similar to @utubun's approach using bind_rows and summarise.

bind_rows(df1, df2) %>%
  group_by(id) %>%
  summarise_all(list(~mean(., na.rm = TRUE))) %>% 
  # convert NaN to NA (if required)
  mutate_at(vars(-id), function(x) ifelse(is.nan(x), NA, x))

Upvotes: 0

A. Suliman
A. Suliman

Reputation: 13135

We can use dplyr::coalesce on the shared columns after splitting using the common names. Then map "loop" through dfs with more than one column and mutate using coalesce

library(dplyr)
library(purrr)
df1 %>% left_join(df2, by='id') %>% 
        split.default(gsub('.[xy]','',names(.))) %>% 
        map_dfc(~if(ncol(.x)==1) .x else 
                mutate(.x, !!sym(gsub('.x','',names(.x)[1])):=coalesce(!!!syms(names(.x))))) %>% 
        select(-contains('.'))

# A tibble: 6 x 5
       id    v1    v2    v3    v4
     <int> <dbl> <dbl> <dbl> <dbl>
  1     1     1     1     1     0
  2     2     0     0     0     1
  3     3     0     0     1     0
  4     4     1     0     1    NA
  5     5     0     1     1    NA
  6     6     1     1    NA    NA

Here a simple example illustrates what sym and syms do to each variable contains .x and .y columns, e.g. here v1. coalesce support tidy dots features hence we use !!!syms.

df_sub <- df1 %>% left_join(df2, by='id') %>% select(v1.x, v1.y)
# . represents df_sub
nm <- gsub('.x','',names(df_sub)[1])
nms <- names(df_sub)
df_sub %>% mutate(!!sym(nm) := coalesce(!!!syms(nms)))

# A tibble: 6 x 3
  v1.x  v1.y    v1
  <dbl> <dbl> <dbl>
1     1     1     1
2     0    NA     0
3    NA     0     0
4     1     1     1
5     0     0     0
6    NA     1     1

sym and syms these functions take strings as input and turn them into symbols, then we use !! and !!! to unquote them. We use := since we need to use a string as column name inside mutate, see my answer here. Read more about sym, syms, !! and !!! here

Upvotes: 2

utubun
utubun

Reputation: 4505

Better answer, comparing to my first one:

bind_rows(df1, df2) %>%
  group_by(id) %>%
  summarise_all(~ coalesce(.[1], .[2]))

## A tibble: 6 x 5
#      id    v1    v2    v3    v4
#   <int> <dbl> <dbl> <dbl> <dbl>
# 1     1     1     1     1     0
# 2     2     0     0     0     1
# 3     3     0     0     1     0
# 4     4     1     0     1    NA
# 5     5     0     1     1    NA
# 6     6     1     1    NA    NA

Upvotes: 3

Related Questions