Reputation:
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
b = data.table(id = 1:2, y = c(11L, 15L))
# > a
# id t x
# 1: 1 1 11
# 2: 1 2 12
# 3: 2 1 13
# 4: 3 2 14
# 5: NA NA 15
# > b
# id y
# 1: 1 11
# 2: 2 15
a[b, on=.(id), sum(x), by = .(id)]
# > a[b, on=.(id), sum(x), by = .(id)]
# id V1
# 1: 1 23
# 2: 1 13
Why does the above query not return id = 2, V1 = 13 in the second row? I get what I would expect using by=.EACHI
though:
a[b, on=.(id), sum(x), by = .EACHI]
# > a[b, on=.(id), sum(x), by = .EACHI]
# id V1
# 1: 1 23
# 2: 2 13
Upvotes: 2
Views: 1240
Reputation:
It seems that when doing a right join between two data.tables, we should use by=.EACHI
in the by
parameter of the join, and not use any variables from the right table (b
here), as they won't be accessible in the resulting joined table. Thats why by = .id
in the first query doesn't work.
As noted in section 3.5.3 here http://franknarf1.github.io/r-tutorial/_book/tables.html
Beware DT[i,on=,j,by=bycols]. Just to repeat: only by=.EACHI works in a join. Typing other by= values there will cause i’s columns to become unavailable
This query helped me understand the above statement a little better:
a[b, .SD, on = .(id)]
# id t x
# 1: 1 1 11
# 2: 1 2 12
# 3: 2 1 13
The columns from b
, besides id
, are not accessible in .SD
for this join.
I guess that means in a join like the above, by
must take either .EACHI
, or a column name from the left table (a
here) that is not the join variable name (as in the question above shows, id
doesn't work right, even though it is in a
too). Because using a column name from a
seems to work correctly:
a[b, sum(x), on = .(id), by = .(t)]
t V1
1: 1 24
2: 2 12
Upvotes: 4