Reputation: 167
I want to match two similar dataframes in R. Both dfs have partially the same variables and a key (id), which contain missing values:
library(tidyverse)
df1 <- as_tibble(list(id = seq(1:6),
v1 = c(1, 0, NA, 1, 0, NA),
v2 = c(NA, NA, 0, 0, 1, NA),
v3 = c(1, 0 , 1, 1, 1, NA)))
df1
# A tibble: 6 x 4
id v1 v2 v3
<int> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 0 NA 0
3 3 NA 0 1
4 4 1 0 1
5 5 0 1 1
6 6 NA NA NA
df2 <- as_tibble(list(id = seq(1:6),
v1 = c(1, NA, 0, 1, 0, 1),
v2 = c(1, 0, 0, NA, 1, 1),
v4 = c(0, 1, 0, NA, NA, NA)))
df2
# A tibble: 6 x 4
id v1 v2 v4
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 NA 0 1
3 3 0 0 0
4 4 1 NA NA
5 5 0 1 NA
6 6 1 1 NA
I'd like to merge those into one df that looks like this:
id v1 v2 v3 v4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 0
2 2 0 0 0 1
3 3 0 0 1 0
4 4 1 0 1 NA
5 5 0 1 1 NA
6 6 1 1 NA NA
So far I've tried all kind of joins from dplyr, but somehow I am not getting it.
Upvotes: 1
Views: 241
Reputation: 2318
Here is another way, similar to @utubun's approach using bind_rows
and summarise
.
bind_rows(df1, df2) %>%
group_by(id) %>%
summarise_all(list(~mean(., na.rm = TRUE))) %>%
# convert NaN to NA (if required)
mutate_at(vars(-id), function(x) ifelse(is.nan(x), NA, x))
Upvotes: 0
Reputation: 13135
We can use dplyr::coalesce
on the shared columns after splitting using the common names. Then map "loop" through dfs with more than one column and mutate
using coalesce
library(dplyr)
library(purrr)
df1 %>% left_join(df2, by='id') %>%
split.default(gsub('.[xy]','',names(.))) %>%
map_dfc(~if(ncol(.x)==1) .x else
mutate(.x, !!sym(gsub('.x','',names(.x)[1])):=coalesce(!!!syms(names(.x))))) %>%
select(-contains('.'))
# A tibble: 6 x 5
id v1 v2 v3 v4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 0
2 2 0 0 0 1
3 3 0 0 1 0
4 4 1 0 1 NA
5 5 0 1 1 NA
6 6 1 1 NA NA
Here a simple example illustrates what sym
and syms
do to each variable contains .x and .y columns, e.g. here v1
. coalesce
support tidy dots features hence we use !!!syms
.
df_sub <- df1 %>% left_join(df2, by='id') %>% select(v1.x, v1.y)
# . represents df_sub
nm <- gsub('.x','',names(df_sub)[1])
nms <- names(df_sub)
df_sub %>% mutate(!!sym(nm) := coalesce(!!!syms(nms)))
# A tibble: 6 x 3
v1.x v1.y v1
<dbl> <dbl> <dbl>
1 1 1 1
2 0 NA 0
3 NA 0 0
4 1 1 1
5 0 0 0
6 NA 1 1
sym
and syms
these functions take strings as input and turn them into symbols, then we use !!
and !!!
to unquote them. We use :=
since we need to use a string as column name inside mutate, see my answer here. Read more about sym
, syms
, !!
and !!!
here
Upvotes: 2
Reputation: 4505
Better answer, comparing to my first one:
bind_rows(df1, df2) %>%
group_by(id) %>%
summarise_all(~ coalesce(.[1], .[2]))
## A tibble: 6 x 5
# id v1 v2 v3 v4
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 0
# 2 2 0 0 0 1
# 3 3 0 0 1 0
# 4 4 1 0 1 NA
# 5 5 0 1 1 NA
# 6 6 1 1 NA NA
Upvotes: 3