Reputation: 23
I have a number of tables that need to be joined on to. However, in certain cells, the value is NA which needs to match onto every possible value.
In SQL it might be like:
SELECT * FROM A
LEFT JOIN B
ON (A.KEY1 = B.KEY1 OR B.KEY1 IS NULL)
AND (A.KEY2 = B.KEY2 OR B.KEY2 IS NULL) # Repeated for every other column
I could work around this by doing many joins such as:
B[A, on = .(Key1, Key2, Key3), Var = i.Var]
B[A[is.na(Key2), ], on = .(Key1, Key3), Var = i.Var]
B[A[is.na(Key3), ], on = .(Key1, Key2), Var = i.Var]
B[A[is.na(Key2) & is.na(Key3), ], on = .(Key1), Var = i.Var]
B[A[is.na(Key1), ], on = .(Key2, Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key2), ], on = .(Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key3), ], on = .(Key2), Var = i.Var]
However this doesn't seem like the best method, especially as the number of columns increases. The above required 7 updating joins for just 3 columns.
For example, if I had one table that matched a description of someone (city they live in, hair colour, height) with a name:
Observed data:
a <- data.table(id = c(1, 2, 3),
city = c("city1", "city2", "city2"),
height = c("tall", "tall", "short"),
hair = c("black", "black", "blonde"))
id city height hair name
1: 1 city1 tall black dave
2: 2 city2 tall black william
3: 3 city2 short blonde jack
Table to match to:
b <- data.table(city = c("city1", "city1", "city2", "city2"),
height = c("tall", "tall", "short", "tall"),
hair = c("black", "blonde", "blonde", "black"),
name = c("dave", "harry", "jack", "william"))
city height hair name
1: city1 tall black dave
2: city1 tall blonde harry
3: city2 short blonde jack
4: city2 tall black william
Joining them:
b[a, on = .(city, height, hair), .(id, city, height, hair, name)]
id city height hair name
1: 1 city1 tall black dave
2: 2 city2 tall black william
3: 3 city2 short blonde jack
This above is as expected. I need it so that if certain fields were missing such as:
city height hair name
1: city1 NA black dave
2: city1 NA blonde harry
3: city2 short NA jack
4: city2 tall black william
It should still produce the same output
Is there any way to do this within the data.table framework efficiently?
Thanks
EDIT:
To make it clearer, if table b was
b <- data.table(city = c("city1", "city1", "city2", "city2"),
height = c(NA, "tall", "short", "tall"),
hair = c("black", "blonde", "blonde", "black"),
name = c("dave", "harry", "jack", "william"))
Then the join produces only:
id city height hair name
1: 1 city1 tall black NA
2: 2 city2 tall black william
3: 3 city2 short blonde jack
When it should produce:
id city height hair name
1: 1 city1 tall black dave
2: 2 city2 tall black william
3: 3 city2 short blonde jack
The NAs should be treated as "wildcards" where they will match to any value.
EDIT2:
A second work around I've found is by cartesian joining the tables first:
ab <- a[, as.list(b), by = .(id, i.city = city, i.height = height, i.hair)]
id i.city i.height i.hair city height hair NAME
1: 1 city1 tall black city1 NA black dave
2: 1 city1 tall black city1 tall blonde harry
3: 1 city1 tall black city2 short blonde jack
4: 1 city1 tall black city2 tall black william
5: 2 city2 tall black city1 NA black dave
6: 2 city2 tall black city1 tall blonde harry
7: 2 city2 tall black city2 short blonde jack
8: 2 city2 tall black city2 tall black william
9: 3 city2 short blonde city1 NA black dave
10: 3 city2 short blonde city1 tall blonde harry
11: 3 city2 short blonde city2 short blonde jack
12: 3 city2 short blonde city2 tall black william
And then applying my conditions after:
ab[(i.city == city | is.na(city))
& (i.height == height | is.na(height))
& (i.hair == hair | is.na(hair))]
id i.city i.height i.hair city height hair name
1: 1 city1 tall black city1 NA black dave
2: 2 city2 tall black city2 tall black william
3: 3 city2 short blonde city2 short blonde jack
I'm not sure if cartesian joining like this is the best way either though when using large datasets.
Upvotes: 2
Views: 119
Reputation: 5059
The least inefficient way I could think of was to simply expand B so that you can do a normal join afterwards.
library(data.table)
a <- data.table(id = c(1, 2, 3),
city = c("city1", "city2", "city2"),
height = c("tall", "tall", "short"),
hair = c("black", "black", "blonde"))
a_unique <- a[, lapply(.SD, function(x) { list(unique(x)) })]
b <- data.table(city = c("city1", "city1", "city2", "city2"),
height = c(NA, "tall", "short", NA),
hair = c("black", NA, "blonde", NA),
name = c("dave", "harry", "jack", "william"))
harmonize <- function(mat) {
ans <- as.vector(t(mat))
ans[!is.na(ans)]
}
expand_recursively <- function(dt, cols) {
if (length(cols) == 0L) return(dt)
current <- cols[1L]
next_cols <- cols[-1L]
not_current <- setdiff(names(dt), current)
na_class <- class(a_unique[[current]][[1L]])
expanded <- data.table(as(NA, na_class), all = a_unique[[current]][[1L]])
setnames(expanded, c(current, "all"))
next_dt <- expanded[dt,
c(list(harmonize(as.matrix(.SD))), mget(not_current)),
on = current,
.SDcols = c(current, "all"),
allow = TRUE]
setnames(next_dt, "V1", current)
expand_recursively(next_dt, next_cols)
}
b_expanded <- expand_recursively(b, intersect(names(a), names(b)))
setcolorder(b_expanded, names(b))
b
city height hair name
1: city1 <NA> black dave
2: city1 tall <NA> harry
3: city2 short blonde jack
4: city2 <NA> <NA> william
b_expanded
city height hair name
1: city1 tall black dave
2: city1 short black dave
3: city1 tall black harry
4: city1 tall blonde harry
5: city2 short blonde jack
6: city2 tall black william
7: city2 tall blonde william
8: city2 short black william
9: city2 short blonde william
I think a problematic thing may be computing a_unique
.
If you know the possible values that can be used to match,
maybe you can specify them directly inside expand_recursively
.
Upvotes: 1