full_join in R with NA as wildcard to match any value in the other data.frame

Question

I'm looking for a way to make dplyr'r full_join treat NA in the "by" columns as wildcards for joining, i.e. that a NA will match any value in the corresponding column of the other data frame and join the corresponding rows.

EXAMPLE: I have several data.frames in R, defining certain product characteristics, some of which depend on finer-grained features, others only on the high-level product class:

x = data.frame(class = c("A", "B", "B", "C"), flags = c(NA, "individual", "group", NA), interest = c(0.04, 0.03, 0.02, 0.05));x
y = data.frame(class = c("A", "A", "B", "B", "C"), flags = c("individual", "group", "individual", "group", NA), costs = c(0.05, 0.025, 0.03, 0.02, 0.01));y

> x
  class      flags interest
1     A            0.04
2     B individual     0.03
3     B      group     0.02
4     C            0.05
> y
  class      flags costs
1     A individual 0.050
2     A      group 0.025
3     B individual 0.030
4     B      group 0.020
5     C        0.010

Now I want to join those two to create one table of product characteristics, but I want the NA in the 'flags' column to match any value in the flags column of the other data.frame, resulting in the following joint data.frame:

> expected
  class      flags interest costs
1     A individual     0.04 0.050
2     A      group     0.04 0.025
3     B individual     0.03 0.030
4     B      group     0.02 0.020
5     C            0.05 0.010

Unfortunately, full_join only matches by exact string match, so the in the flags column will not match any of the values ("group", "partner", "individual") of the flags column of the other data frame:

> actual = full_join(x,y, by=c("class","flags"));actual
  class      flags interest costs
1     A            0.04    NA
2     B individual     0.03 0.030
3     B      group     0.02 0.020
4     C            0.05 0.010
5     A individual       NA 0.050
6     A      group       NA 0.025

How can I make full_join treat NA as matching any string (rather than matching only other NAs)?

Gregor Thomas · Accepted Answer

You can join only on the class column and then filter:

library(dplyr)
x |>
  full_join(y, by = "class", relationship = "many-to-many") |>
  filter(flags.x == flags.y | is.na(flags.x) | is.na(flags.y)) |>
  mutate(flags = coalesce(flags.x, flags.y)) |>
  select(class, flags, interest, costs)
#   class      flags interest costs
# 1     A individual     0.04 0.050
# 2     A      group     0.04 0.025
# 3     B individual     0.03 0.030
# 4     B      group     0.02 0.020
# 5     C            0.05 0.010

full_join in R with NA as wildcard to match any value in the other data.frame

Answers (2)

Related Questions