Reputation: 25375
I have recently started using the data.table package in R, but I recently stumbled into an issue that I do not know how to tackle with data.table.
Sample data:
set.seed(1)
library(data.table)
dt = data.table(group=c("A","A","A","B","B","B","C","C"),value = runif(8))
I can add a group count with the statement
dt[,groupcount := .N ,group]
but now I only want to keep the x groups with the largest value for groupcount
. Let's assume x=1
for the example.
I tried chaining as follows:
dt[,groupcount := .N ,group][groupcount %in% head(sort(unique(groupcount),decreasing=TRUE),1)]
But since group A and B both have three elements, they both remain in the data.table. I only want the x largest groups where x=1, so I only want one of the groups (A or B) to remain. I assume this can be done in a single line with data.table. Is this true, and if yes, how?
To clarify: x is an arbitrarily chosen number here. The function should also work with x=3, where it would return the 3 largest groups.
Upvotes: 3
Views: 115
Reputation: 38500
Here is a method that uses a join.
x <- 1
dt[dt[, .N, by=group][order(-N)[1:x]], on="group"]
group value N
1: A 0.2655087 3
2: A 0.3721239 3
3: A 0.5728534 3
The inner data.frame is aggregated to count the observations and the position of the x largest groups is retrieved using order
subset using the value of x. The resulting data frame is then joined onto the original by group.
Upvotes: 3
Reputation: 26248
How about making use of the order of the groupcount
setorder(dt, -groupcount)
x <- 1
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
x <- 3
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
# 4: B 0.9082078 3
# 5: B 0.2016819 3
# 6: B 0.8983897 3
# 7: C 0.9446753 2
# 8: C 0.6607978 2
## alternative syntax
# dt[group %in% unique(dt$group)[1:x] ]
Upvotes: 2
Reputation: 886938
We can do
x <- 1
dt[dt[, {tbl <- table(group)
nm <- names(tbl)[tbl==max(tbl)]
if(length(nm) < x) rep(TRUE, .N)
else group %in% sample(names(tbl)[tbl==max(tbl)], x)}]]
Upvotes: 2