Ash Reddy
Ash Reddy

Reputation: 1042

r data.table: aggregating the grouping column inconsistency

I'm using data.table package to aggregate a column which is also a grouping column. But the results are not what I expected.

my_data =  data.table(contnt=c("america", "asia", "asia","europe", "europe", "europe"), num= 1:6)

#my_data
#contnt  num
#america  1
#asia     2
#asia     3
#europe   4
#europe   5
#europe   6

my_data[, length(contnt),by=contnt]
#contnt  V1
#america  1
#asia     1
#europe   1

It works differently when I aggregate a column other than grouping column

my_data[, length(num),by=contnt]
#contnt  V1
#america  1
#asia     2
#europe   3

What causes this discrepancy?

Upvotes: 2

Views: 126

Answers (2)

Henrik
Henrik

Reputation: 67778

Please study the data.table FAQ:

Inside each group, why are the group variables length-1?

[...] x is a grouping variable and (as from v1.6.1) has length 1 (if inspected or used in j). It's for efficiency and convenience. [...]

If you need the size of the current group, use .N rather than calling length() on any column.

Upvotes: 3

shrgm
shrgm

Reputation: 1344

This is a great example to demonstrate the way data.table passes grouping variables vs. other variables to functions:

my_data[,print(contnt),by=contnt]
# [1] "america"
# [1] "asia"
# [1] "europe"

my_data[,print(num),by=contnt]
# [1] 1
# [1] 2 3
# [1] 4 5 6

Essentially, grouping variables are passed as vectors of length 1 for each group, whereas for other variables, the entire vector for each group is passed.

Upvotes: 6

Related Questions