Reinhold Kainhofer
Reinhold Kainhofer

Reputation: 340

full_join in R with NA as wildcard to match any value in the other data.frame

I'm looking for a way to make dplyr'r full_join treat NA in the "by" columns as wildcards for joining, i.e. that a NA will match any value in the corresponding column of the other data frame and join the corresponding rows.

EXAMPLE: I have several data.frames in R, defining certain product characteristics, some of which depend on finer-grained features, others only on the high-level product class:

x = data.frame(class = c("A", "B", "B", "C"), flags = c(NA, "individual", "group", NA), interest = c(0.04, 0.03, 0.02, 0.05));x
y = data.frame(class = c("A", "A", "B", "B", "C"), flags = c("individual", "group", "individual", "group", NA), costs = c(0.05, 0.025, 0.03, 0.02, 0.01));y
> x
  class      flags interest
1     A       <NA>     0.04
2     B individual     0.03
3     B      group     0.02
4     C       <NA>     0.05
> y
  class      flags costs
1     A individual 0.050
2     A      group 0.025
3     B individual 0.030
4     B      group 0.020
5     C       <NA> 0.010

Now I want to join those two to create one table of product characteristics, but I want the NA in the 'flags' column to match any value in the flags column of the other data.frame, resulting in the following joint data.frame:

> expected
  class      flags interest costs
1     A individual     0.04 0.050
2     A      group     0.04 0.025
3     B individual     0.03 0.030
4     B      group     0.02 0.020
5     C       <NA>     0.05 0.010

Unfortunately, full_join only matches by exact string match, so the in the flags column will not match any of the values ("group", "partner", "individual") of the flags column of the other data frame:

> actual = full_join(x,y, by=c("class","flags"));actual
  class      flags interest costs
1     A       <NA>     0.04    NA
2     B individual     0.03 0.030
3     B      group     0.02 0.020
4     C       <NA>     0.05 0.010
5     A individual       NA 0.050
6     A      group       NA 0.025

How can I make full_join treat NA as matching any string (rather than matching only other NAs)?

Upvotes: 1

Views: 54

Answers (2)

Ben
Ben

Reputation: 30474

I wonder if you could try a different approach by joining on non-NA columns. For example, split your data.frame by each row, then select the columns that are not NA, and use those to join with the second data.frame. For class "A" that has NA for flags, it will do a join based on class only; while for class "B", it will join on both class and flags.

library(tidyverse)

x %>%
  split(seq(nrow(.))) %>%
  map_dfr(~.x %>%
             select_if(!is.na(.)) %>%
             inner_join(y, .))

Output

  class      flags costs interest
1     A individual 0.050     0.04
2     A      group 0.025     0.04
3     B individual 0.030     0.03
4     B      group 0.020     0.02
5     C       <NA> 0.010     0.05

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145755

You can join only on the class column and then filter:

library(dplyr)
x |>
  full_join(y, by = "class", relationship = "many-to-many") |>
  filter(flags.x == flags.y | is.na(flags.x) | is.na(flags.y)) |>
  mutate(flags = coalesce(flags.x, flags.y)) |>
  select(class, flags, interest, costs)
#   class      flags interest costs
# 1     A individual     0.04 0.050
# 2     A      group     0.04 0.025
# 3     B individual     0.03 0.030
# 4     B      group     0.02 0.020
# 5     C       <NA>     0.05 0.010

Upvotes: 3

Related Questions