Reputation: 340
I'm looking for a way to make dplyr'r full_join treat NA in the "by" columns as wildcards for joining, i.e. that a NA will match any value in the corresponding column of the other data frame and join the corresponding rows.
EXAMPLE: I have several data.frames in R, defining certain product characteristics, some of which depend on finer-grained features, others only on the high-level product class:
x = data.frame(class = c("A", "B", "B", "C"), flags = c(NA, "individual", "group", NA), interest = c(0.04, 0.03, 0.02, 0.05));x
y = data.frame(class = c("A", "A", "B", "B", "C"), flags = c("individual", "group", "individual", "group", NA), costs = c(0.05, 0.025, 0.03, 0.02, 0.01));y
> x
class flags interest
1 A <NA> 0.04
2 B individual 0.03
3 B group 0.02
4 C <NA> 0.05
> y
class flags costs
1 A individual 0.050
2 A group 0.025
3 B individual 0.030
4 B group 0.020
5 C <NA> 0.010
Now I want to join those two to create one table of product characteristics, but I want the NA in the 'flags' column to match any value in the flags column of the other data.frame, resulting in the following joint data.frame:
> expected
class flags interest costs
1 A individual 0.04 0.050
2 A group 0.04 0.025
3 B individual 0.03 0.030
4 B group 0.02 0.020
5 C <NA> 0.05 0.010
Unfortunately, full_join only matches by exact string match, so the in the flags column will not match any of the values ("group", "partner", "individual") of the flags column of the other data frame:
> actual = full_join(x,y, by=c("class","flags"));actual
class flags interest costs
1 A <NA> 0.04 NA
2 B individual 0.03 0.030
3 B group 0.02 0.020
4 C <NA> 0.05 0.010
5 A individual NA 0.050
6 A group NA 0.025
How can I make full_join treat NA as matching any string (rather than matching only other NAs)?
Upvotes: 1
Views: 54
Reputation: 30474
I wonder if you could try a different approach by joining on non-NA
columns. For example, split
your data.frame by each row, then select the columns that are not NA
, and use those to join with the second data.frame. For class
"A" that has NA
for flags
, it will do a join based on class
only; while for class
"B", it will join on both class
and flags
.
library(tidyverse)
x %>%
split(seq(nrow(.))) %>%
map_dfr(~.x %>%
select_if(!is.na(.)) %>%
inner_join(y, .))
Output
class flags costs interest
1 A individual 0.050 0.04
2 A group 0.025 0.04
3 B individual 0.030 0.03
4 B group 0.020 0.02
5 C <NA> 0.010 0.05
Upvotes: 0
Reputation: 145755
You can join only on the class
column and then filter:
library(dplyr)
x |>
full_join(y, by = "class", relationship = "many-to-many") |>
filter(flags.x == flags.y | is.na(flags.x) | is.na(flags.y)) |>
mutate(flags = coalesce(flags.x, flags.y)) |>
select(class, flags, interest, costs)
# class flags interest costs
# 1 A individual 0.04 0.050
# 2 A group 0.04 0.025
# 3 B individual 0.03 0.030
# 4 B group 0.02 0.020
# 5 C <NA> 0.05 0.010
Upvotes: 3