user9102888
user9102888

Reputation:

`by` and `.EACHI` in data.table

a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
b = data.table(id = 1:2, y = c(11L, 15L))

# > a
# id  t  x
# 1:  1  1 11
# 2:  1  2 12
# 3:  2  1 13
# 4:  3  2 14
# 5: NA NA 15

# > b
# id  y
# 1:  1 11
# 2:  2 15

a[b, on=.(id), sum(x), by = .(id)]
# > a[b, on=.(id), sum(x), by = .(id)]
# id V1
# 1:  1 23
# 2:  1 13

Why does the above query not return id = 2, V1 = 13 in the second row? I get what I would expect using by=.EACHI though:

a[b, on=.(id), sum(x), by = .EACHI]
# > a[b, on=.(id), sum(x), by = .EACHI]
# id V1
# 1:  1 23
# 2:  2 13

Upvotes: 2

Views: 1240

Answers (1)

user9102888
user9102888

Reputation:

It seems that when doing a right join between two data.tables, we should use by=.EACHI in the by parameter of the join, and not use any variables from the right table (b here), as they won't be accessible in the resulting joined table. Thats why by = .id in the first query doesn't work.

As noted in section 3.5.3 here http://franknarf1.github.io/r-tutorial/_book/tables.html

Beware DT[i,on=,j,by=bycols]. Just to repeat: only by=.EACHI works in a join. Typing other by= values there will cause i’s columns to become unavailable

This query helped me understand the above statement a little better:

a[b, .SD, on = .(id)]
# id t  x
# 1:  1 1 11
# 2:  1 2 12
# 3:  2 1 13

The columns from b, besides id, are not accessible in .SD for this join.

I guess that means in a join like the above, by must take either .EACHI, or a column name from the left table (a here) that is not the join variable name (as in the question above shows, id doesn't work right, even though it is in a too). Because using a column name from a seems to work correctly:

a[b, sum(x), on = .(id), by = .(t)]
   t V1
1: 1 24
2: 2 12

Upvotes: 4

Related Questions