Reputation: 28993
I have a dataframe (df1
) and a list of dataframes (test
) like below; I want to join df1
with each of the datafraems in test
and populate a new column (X
), while keeping all the other records intact.
read.table(text = "Fruits A B C D
Apple 10 1.3 NA NA
Orange 0.2 NA 0.21 NA
Grape NA 0.06 51 0.7
Grape NA 0.06 51 0.7
Grape 1 0.06 51 0.7
Grape NA 0.06 NA 0.8
Berry 11 20 0.3 0.04
Apple NA 1.1 0.5 NA
Apple NA 1.2 0.5 NA
Apple NA 1.3 0.1 NA
Berry NA NA 0.3 0.04
Berry 1 NA 0.9 0.01
Apple 1 1.3 0.5 NA
Apple 1 1.3 0.5 NA",
stringsAsFactors = FALSE, header = TRUE) -> df1
list(data.frame(Fruits = c("Apple"), A = 10, X = "oh"),
data.frame(Fruits = c("Berry"), A = 11, B = 20, X = "duh")) -> test
Here's the expected output:
Fruits A B C D X
1 Apple 10.0 1.30 NA NA oh
2 Orange 0.2 NA 0.21 NA NA
3 Grape NA 0.06 51.00 0.70 NA
4 Grape NA 0.06 51.00 0.70 NA
5 Grape 1.0 0.06 51.00 0.70 NA
6 Grape NA 0.06 NA 0.80 NA
7 Berry 11.0 20.00 0.30 0.04 duh
8 Apple NA 1.10 0.50 NA NA
9 Apple NA 1.20 0.50 NA NA
10 Apple NA 1.30 0.10 NA NA
11 Berry NA NA 0.30 0.04 NA
12 Berry 1.0 NA 0.90 0.01 NA
13 Apple 1.0 1.30 0.50 NA NA
14 Apple 1.0 1.30 0.50 NA NA
Simply looping through the dataframes within test
does not work since it creates a dataframe for each ..._join
and also creates duplicated rows for the second iteration. Maybe we can use a conditional mutate
.
purrr::map(test, ~full_join(df1, .x))
It is more than likely I am missing something simple, but I don't want to join the outputs of full_join
afterwards since my actual df1
has over 1M rows.
Upvotes: 6
Views: 190
Reputation: 887291
We can use
library(powerjoin)
library(dplyr)
test %>%
reduce(power_full_join, .init = df1, conflict = coalesce_xy)
Or as @moodymudskipper mentioned in the comments, power_full_join
does join recursively
power_full_join(df1, test, conflict = coalesce_xy)
Upvotes: 4
Reputation: 17184
I would use purrr::reduce()
instead of map()
. But this raises the issue that after the first iteration, X
appears in both dataframes and is treated as a key. One workaround would be to give all the X
columns unique names, then coalesce after joining.
library(dplyr)
library(purrr)
test2 <- imap(test, ~ rename(.x, "X{.y}" := X))
test2 %>%
reduce(full_join, .init = df1) %>%
mutate(X = coalesce(X1, X2), .keep = "unused")
Fruits A B C D X
1 Apple 10.0 1.30 NA NA oh
2 Orange 0.2 NA 0.21 NA <NA>
3 Grape NA 0.06 51.00 0.70 <NA>
4 Grape NA 0.06 51.00 0.70 <NA>
5 Grape 1.0 0.06 51.00 0.70 <NA>
6 Grape NA 0.06 NA 0.80 <NA>
7 Berry 11.0 20.00 0.30 0.04 duh
8 Apple NA 1.10 0.50 NA <NA>
9 Apple NA 1.20 0.50 NA <NA>
10 Apple NA 1.30 0.10 NA <NA>
11 Berry NA NA 0.30 0.04 <NA>
12 Berry 1.0 NA 0.90 0.01 <NA>
13 Apple 1.0 1.30 0.50 NA <NA>
14 Apple 1.0 1.30 0.50 NA <NA>
If test
has a lot of elements, it'll be annoying to list out coalesce(X1, X2, X3, ..., Xn)
. In that case, you can use this alternative:
test2 %>%
reduce(full_join, .init = df1) %>%
mutate(X = coalesce(!!!syms(paste0("X", seq_along(test2)))), .keep = "unused")
Upvotes: 2
Reputation: 206308
It seems like this may be a good use of the new rows_update()
method. You can iterate over the list of rows to update inside a reduce
to apply them sequentially. For example
purrr::reduce(test, function(data, match) {
rows_update(data, match, setdiff(names(match), "X"))
}, .init=data.frame(df1, X=NA_character_))
This adds a column named X which is all NA at first, and then at each iteration it updates the value of X using whatever overlapping columns there are that are not named "X".
Upvotes: 3