Reputation: 1042
I'm using data.table package to aggregate a column which is also a grouping column. But the results are not what I expected.
my_data = data.table(contnt=c("america", "asia", "asia","europe", "europe", "europe"), num= 1:6)
#my_data
#contnt num
#america 1
#asia 2
#asia 3
#europe 4
#europe 5
#europe 6
my_data[, length(contnt),by=contnt]
#contnt V1
#america 1
#asia 1
#europe 1
It works differently when I aggregate a column other than grouping column
my_data[, length(num),by=contnt]
#contnt V1
#america 1
#asia 2
#europe 3
What causes this discrepancy?
Upvotes: 2
Views: 126
Reputation: 67778
Please study the data.table
FAQ:
Inside each group, why are the group variables length-1?
[...]
x
is a grouping variable and (as from v1.6.1) has length 1 (if inspected or used inj
). It's for efficiency and convenience. [...]If you need the size of the current group, use
.N
rather than callinglength()
on any column.
Upvotes: 3
Reputation: 1344
This is a great example to demonstrate the way data.table passes grouping variables vs. other variables to functions:
my_data[,print(contnt),by=contnt]
# [1] "america"
# [1] "asia"
# [1] "europe"
my_data[,print(num),by=contnt]
# [1] 1
# [1] 2 3
# [1] 4 5 6
Essentially, grouping variables are passed as vectors of length 1 for each group, whereas for other variables, the entire vector for each group is passed.
Upvotes: 6