harv
harv

Reputation: 23

Left joining data.tables on matching or NA columns

I have a number of tables that need to be joined on to. However, in certain cells, the value is NA which needs to match onto every possible value.

In SQL it might be like:

SELECT * FROM A
LEFT JOIN B
ON (A.KEY1 = B.KEY1 OR B.KEY1 IS NULL)
AND (A.KEY2 = B.KEY2 OR B.KEY2 IS NULL) # Repeated for every other column

I could work around this by doing many joins such as:

B[A, on = .(Key1, Key2, Key3), Var = i.Var]
B[A[is.na(Key2), ], on = .(Key1, Key3), Var = i.Var]
B[A[is.na(Key3), ], on = .(Key1, Key2), Var = i.Var]
B[A[is.na(Key2) & is.na(Key3), ], on = .(Key1), Var = i.Var]
B[A[is.na(Key1), ], on = .(Key2, Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key2), ], on = .(Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key3), ], on = .(Key2), Var = i.Var]

However this doesn't seem like the best method, especially as the number of columns increases. The above required 7 updating joins for just 3 columns.

For example, if I had one table that matched a description of someone (city they live in, hair colour, height) with a name:

Observed data:

a <- data.table(id = c(1, 2, 3),
            city = c("city1", "city2", "city2"),
            height = c("tall", "tall", "short"),
            hair = c("black", "black", "blonde"))
       id  city height   hair    name
    1:  1 city1   tall  black    dave
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

Table to match to:

b <- data.table(city = c("city1", "city1", "city2", "city2"),
            height = c("tall", "tall", "short", "tall"),
            hair = c("black", "blonde", "blonde", "black"),
            name = c("dave", "harry", "jack", "william"))
    city height   hair    name
1: city1   tall  black    dave
2: city1   tall blonde   harry
3: city2  short blonde    jack
4: city2   tall  black william

Joining them:

b[a, on = .(city, height, hair), .(id, city, height, hair, name)]
       id  city height   hair    name
    1:  1 city1   tall  black    dave
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

This above is as expected. I need it so that if certain fields were missing such as:

        city height   hair    name
    1: city1     NA  black    dave
    2: city1     NA blonde   harry
    3: city2  short     NA    jack
    4: city2   tall  black william

It should still produce the same output

Is there any way to do this within the data.table framework efficiently?

Thanks

EDIT:

To make it clearer, if table b was

    b <- data.table(city = c("city1", "city1", "city2", "city2"),
                    height = c(NA, "tall", "short", "tall"),
                    hair = c("black", "blonde", "blonde", "black"),
                    name = c("dave", "harry", "jack", "william"))

Then the join produces only:

       id  city height   hair    name
    1:  1 city1   tall  black      NA
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

When it should produce:

       id  city height   hair    name
    1:  1 city1   tall  black    dave
    2:  2 city2   tall  black william
    3:  3 city2  short blonde    jack

The NAs should be treated as "wildcards" where they will match to any value.

EDIT2:

A second work around I've found is by cartesian joining the tables first:

    ab <- a[, as.list(b), by = .(id, i.city = city, i.height = height, i.hair)]

       id i.city i.height i.hair  city height   hair    NAME
     1:  1  city1     tall  black city1     NA  black    dave
     2:  1  city1     tall  black city1   tall blonde   harry
     3:  1  city1     tall  black city2  short blonde    jack
     4:  1  city1     tall  black city2   tall  black william
     5:  2  city2     tall  black city1     NA  black    dave
     6:  2  city2     tall  black city1   tall blonde   harry
     7:  2  city2     tall  black city2  short blonde    jack
     8:  2  city2     tall  black city2   tall  black william
     9:  3  city2    short blonde city1     NA  black    dave
    10:  3  city2    short blonde city1   tall blonde   harry
    11:  3  city2    short blonde city2  short blonde    jack
    12:  3  city2    short blonde city2   tall  black william

And then applying my conditions after:

    ab[(i.city == city | is.na(city)) 
       & (i.height == height | is.na(height)) 
       & (i.hair == hair | is.na(hair))]

     id i.city i.height i.hair  city height   hair    name
    1:  1  city1     tall  black city1     NA  black    dave
    2:  2  city2     tall  black city2   tall  black william
    3:  3  city2    short blonde city2  short blonde    jack

I'm not sure if cartesian joining like this is the best way either though when using large datasets.

Upvotes: 2

Views: 119

Answers (1)

Alexis
Alexis

Reputation: 5059

The least inefficient way I could think of was to simply expand B so that you can do a normal join afterwards.

library(data.table)

a <- data.table(id = c(1, 2, 3),
                city = c("city1", "city2", "city2"),
                height = c("tall", "tall", "short"),
                hair = c("black", "black", "blonde"))

a_unique <- a[, lapply(.SD, function(x) { list(unique(x)) })]

b <- data.table(city = c("city1", "city1", "city2", "city2"),
                height = c(NA, "tall", "short", NA),
                hair = c("black", NA, "blonde", NA),
                name = c("dave", "harry", "jack", "william"))

harmonize <- function(mat) {
  ans <- as.vector(t(mat))
  ans[!is.na(ans)]
}

expand_recursively <- function(dt, cols) {
  if (length(cols) == 0L) return(dt)

  current <- cols[1L]
  next_cols <- cols[-1L]
  not_current <- setdiff(names(dt), current)

  na_class <- class(a_unique[[current]][[1L]])
  expanded <- data.table(as(NA, na_class), all = a_unique[[current]][[1L]])
  setnames(expanded, c(current, "all"))

  next_dt <- expanded[dt,
                      c(list(harmonize(as.matrix(.SD))), mget(not_current)),
                      on = current,
                      .SDcols = c(current, "all"),
                      allow = TRUE]

  setnames(next_dt, "V1", current)
  expand_recursively(next_dt, next_cols)
}

b_expanded <- expand_recursively(b, intersect(names(a), names(b)))
setcolorder(b_expanded, names(b))

b
    city height   hair    name
1: city1   <NA>  black    dave
2: city1   tall   <NA>   harry
3: city2  short blonde    jack
4: city2   <NA>   <NA> william

b_expanded
    city height   hair    name
1: city1   tall  black    dave
2: city1  short  black    dave
3: city1   tall  black   harry
4: city1   tall blonde   harry
5: city2  short blonde    jack
6: city2   tall  black william
7: city2   tall blonde william
8: city2  short  black william
9: city2  short blonde william

I think a problematic thing may be computing a_unique. If you know the possible values that can be used to match, maybe you can specify them directly inside expand_recursively.

Upvotes: 1

Related Questions