Strange behaviour subsetting by multiple keys using data.table

Question

I've set multiple keys in a data.table, but when I try and select rows by multiple values of the keys, it seems to return a row for each potential combination, but filled with NA for rows that do not exist.

I can get the sample code in 1c of this document, so it must be something that I'm just not seeing. Any help would be much appreciated.

library(data.table)

dt = data.table(colA = 1:4,
                colB = c("A","A","B","B"),
                colC = 11:14)

setkey(dt,colA,colB)

print(dt)
# colA colB colC
# 1:    1    A   11
# 2:    2    A   12
# 3:    3    B   13
# 4:    4    B   14

print(
  dt[.(2,"A")]
)
# As expected
# colA colB colC
# 1:    2    A   12

print(
  dt[.(c(2,3),"A")]
)
# colA colB colC
# 1:    2    A   12
# 2:    3    A   NA #Unexpected

print(
  dt[.(unique(colA),"A")]
)
# colA colB colC
# 1:    1    A   11
# 2:    2    A   12
# 3:    3    A   NA #Unexpected
# 4:    4    A   NA #Unexpected

Frank · Accepted Answer

DT[i] will look up each row of i in rows of DT. By default, unmatched rows of i are shown with NA. To instead drop unmatched rows, use nomatch = 0:

dt[.(unique(colA),"A"), nomatch=0]

#    colA colB colC
# 1:    1    A   11
# 2:    2    A   12

The nomatch argument is covered in the vignette the OP linked. To find the latest version of the vignette, use browseVignettes("data.table").

As a side note, it is not necessary to set keys before joining any more. Instead, on= can be used:

library(data.table)
dt2 = data.table(colA = 1:4,
                colB = c("A","A","B","B"),
                colC = 11:14)

dt2[.(unique(colA),"A"), on=.(colA, colB), nomatch=0]

#    colA colB colC
# 1:    1    A   11
# 2:    2    A   12

See Arun's answer for details on why keying is not usually necessary to improve performance in joins. It says:

Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

I typically only set keys when I am doing joins interactively, so I can skip typing out the on=.

Strange behaviour subsetting by multiple keys using data.table

Answers (1)

Related Questions