Reputation: 588
I've set multiple keys in a data.table, but when I try and select rows by multiple values of the keys, it seems to return a row for each potential combination, but filled with NA for rows that do not exist.
I can get the sample code in 1c of this document, so it must be something that I'm just not seeing. Any help would be much appreciated.
library(data.table)
dt = data.table(colA = 1:4,
colB = c("A","A","B","B"),
colC = 11:14)
setkey(dt,colA,colB)
print(dt)
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
# 3: 3 B 13
# 4: 4 B 14
print(
dt[.(2,"A")]
)
# As expected
# colA colB colC
# 1: 2 A 12
print(
dt[.(c(2,3),"A")]
)
# colA colB colC
# 1: 2 A 12
# 2: 3 A NA #Unexpected
print(
dt[.(unique(colA),"A")]
)
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
# 3: 3 A NA #Unexpected
# 4: 4 A NA #Unexpected
Upvotes: 3
Views: 40
Reputation: 66819
DT[i]
will look up each row of i
in rows of DT
. By default, unmatched rows of i
are shown with NA
. To instead drop unmatched rows, use nomatch = 0
:
dt[.(unique(colA),"A"), nomatch=0]
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
The nomatch
argument is covered in the vignette the OP linked. To find the latest version of the vignette, use browseVignettes("data.table")
.
As a side note, it is not necessary to set keys before joining any more. Instead, on=
can be used:
library(data.table)
dt2 = data.table(colA = 1:4,
colB = c("A","A","B","B"),
colC = 11:14)
dt2[.(unique(colA),"A"), on=.(colA, colB), nomatch=0]
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
See Arun's answer for details on why keying is not usually necessary to improve performance in joins. It says:
Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
I typically only set keys when I am doing joins interactively, so I can skip typing out the on=
.
Upvotes: 4