RSA
RSA

Reputation: 33

R data.table summary not consistent between explicit sum and .SD

in R data table summarizing producing inconsistent results

DT = data.table(
  A = rep(1:3, each = 5L), 
  B = rep(1:5, 3L),
  C = sample(15L),
  D = sample(15L)
)
DT[, .(suma = sum(A), sumb = sum(B), sumc=sum(C), sumd= sum(D)), by=A]

Consciously summarizing a grouping variable is added. However this produces

# A data.table: 3 × 5 
A   suma    sumb    sumc    sumd
<int>   <int>   <int>   <int>   <int>
1   1   15  40  36
2   2   15  39  38
3   3   15  41  46

which is not correct. However modifying this like

DT[, lapply(.SD, sum), by=A, .SDcols=c('A' ,'B','C','D')]

give correct results

A data.table: 3 × 5 
A   A   B   C   D
<int>   <int>   <int>   <int>   <int>
1   5   15  40  36
2   10  15  39  38
3   15  15  41  46

Is this inconsistency expected or any reason why they are giving different results?

Upvotes: 3

Views: 121

Answers (1)

SamR
SamR

Reputation: 20494

I can reproduce this with data.table 1.15.41 and I agree the output is far from intuitive.

What seems to be happening is this: when you perform an operation like DT[, .(suma = sum(A)), by = A], data.table treats A in the j expression as the grouping key rather than as a vector of values in the A column. This means sum(A) operates on the group identifier - a scalar - not the actual data within each group.

We can see this if we try to take the length() of the values.

DT[, .(len_a = length(A), len_b = length(B)), A]
#        A len_a len_b
#    <int> <int> <int>
# 1:     1     1     5
# 2:     2     1     5
# 3:     3     1     5

Contrast this with using .SD:

DT[, lapply(.SD, length), A, .SDcols = c("A", "B")]
#        A     A     B
#    <int> <int> <int>
# 1:     1     5     5
# 2:     2     5     5
# 3:     3     5     5

Verbose output provides a little more insight:

DT[, .(len_a = length(A), len_b = length(B)), A, verbose = TRUE]
# Detected that j uses these columns: [B]
# Finding groups using forderv ... forder.c received 15 rows and 1 columns
# 0.000s elapsed (0.000s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(length(A), length(B))'

#        A len_a len_b
#    <int> <int> <int>
# 1:     1     1     5
# 2:     2     1     5
# 3:     3     1     5

The output shows that data.table skips using the A column as a vector:

Detected that j uses these columns: [B]

Yet it also claims:

j unchanged as 'list(length(A), length(B))'

This is a little misleading as the A here is the group key instead of the A column.

I skimmed the Aggregations section of the data.table docs and I couldn't find anything alluding to this behaviour. However, @Gusbourne has pointed out in a comment that this is mentioned in the data.table FAQ 2.10, Inside each group, why are the group variables length-1? The answer is: for efficiency and convenience.

I think it's quite unusual to perform numeric operations on a grouping column. Nevertheless, the output is not what I would expect and to me this seems like a bug to me - at least in the prominence of this in documentation, if not in implementation. If one did not already exist, it might be worth submitting a GitHub issue. However, @Gusbourne has drawn my attention to a related issue from 2018. This seems to be considered a feature and the issue has been closed so I think the best thing to do is just avoid the undesired behaviour using the method in your question, by using .SD.

Upvotes: 5

Related Questions