Fabian Werner
Fabian Werner

Reputation: 1027

data.table: cartesian join and nomatch

I want to do a cartesian (full outer) join using the fabulous data.table package in R. However, I want unmatched rows to be mentioned as well, i.e. my two data.tables "left" and "right" look like

key | data_left
  1 |       aaa
  2 |       bbb
  3 |       ccc

and

key | data_right
  1 |        xxx
  2 |        yyy

The cross join with a key column "key" gives me

key | data_left | data_right
  1 |       aaa |        xxx
  2 |       bbb |        yyy

however, the unmatched row 3 | ccc is completely missing. Adding the option nomatch=0 (instead of nomatch=NA) did not help. I want data.table to just fill up the remaining columns with NA so I expect

key | data_left | data_right
  1 |       aaa |        xxx
  2 |       bbb |        yyy
  3 |       ccc |         NA

Any idea what I can do in order to get this to work?

Code sample:

library(data.table)
left = data.table(keyCol = c(1,2,3), data_left = c("aaa", "bbb", "ccc"))
right = data.table(keyCol = c(1,2), data_right = c("xxx", "yyy"))
setkey(left, keyCol)
setkey(right, keyCol)
res0 = left[right, allow.cartesian=TRUE, nomatch=NA]
resNA = left[right, allow.cartesian=TRUE, nomatch=0]

Upvotes: 1

Views: 1070

Answers (1)

Frank
Frank

Reputation: 66819

Assuming there is at most one row per keyCol value, I'd do...

# setup
kc = "keyCol"
DTs = list(left, right)

# make main table with key col(s)
DT = unique(rbindlist(lapply(DTs, `[`, j = ..kc)))

# get non-key cols
for (d in DTs){
  cols = setdiff(names(d), kc)
  DT[d, on=kc, (cols) := mget(sprintf("i.%s", cols)) ][]
}

# cleanup loop vars
rm(d, cols)

This should work for more general cases with...

  • more key cols (in kc) and
  • more tables with non-overlapping col names (in DTs).

If you want the key cols as the key in the result, the code simplifies a little:

# make main table with key col(s)
DT = setkey(unique(rbindlist(lapply(DTs, `[`, j = ..kc))))

# get non-key cols
for (d in DTs){
  cols = setdiff(names(d), kc)
  DT[d, (cols) := mget(sprintf("i.%s", cols)) ][]
}

Upvotes: 1

Related Questions