Reputation: 742
I'm grouping a data.table and want to select from each group the first row where x == 1 or, if such a row does not exist, then the first row with any value in x
d <- data.table(
a = c(1,1,1, 2,2, 3,3),
x = c(0,1,0, 0,0, 1,1),
y = c(1,2,3, 1,2, 1,2)
)
this attempt
d[, ifelse(any(.SD[,x] == 1),.SD[x == 1][1], .SD[1]), by = a]
returns
a V1
1: 1 1
2: 2 0
3: 3 1
but i expected
a x y
1: 1 1 2
2: 2 0 1
3: 3 1 1
Any ideas how to get it right?
Upvotes: 17
Views: 1251
Reputation: 887058
We can also do this with .I
to return the row index and use that for subsetting the rows.
d[d[, .I[which.max(x==1)], by = a]$V1]
# a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1
In the current versions of data.table
, .I
approach is more efficient compared to the .SD
for subsetting rows (However, it could change in the future). This is also a similar post
Here is another option with order
(setkey
can also be used - for efficiency) the dataset by 'a' and 'x' after grouping by 'a', and then get the first row with head
d[order(a ,-x), head(.SD, 1) ,by = a]
# a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1
Initially, we were thinking about benchmarking on > 1e6, but the .SD
methods are taking time, so comparing on 3e5
rows using data.table_1.9.7
set.seed(24)
d1 <- data.table(a = rep(1:1e5, 3), x = sample(0:1, 1e5*3,
replace=TRUE), y = rnorm(1e5*3))
system.time(d1[, .SD[which.max(x == 1)], by = a])
# user system elapsed
# 56.21 30.64 86.42
system.time(d1[, .SD[match(1L, x, nomatch = 1L)], by = a])
# user system elapsed
# 55.27 30.07 83.75
system.time(d1[d1[, .I[which.max(x==1)], by = a]$V1])
# user system elapsed
# 0.19 0.00 0.19
system.time(d1[order(a ,-x), head(.SD, 1) ,by = a])
# user system elapsed
# 0.03 0.00 0.04
Upvotes: 6
Reputation: 49448
Another option (which.max
is basically designed to do exactly what you want):
d[, .SD[which.max(x == 1)], by = a]
# a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1
Upvotes: 15
Reputation: 92282
I think it's a good use case for both match
and it's nomatch
argument
d[, .SD[match(1L, x, nomatch = 1L)], by = a]
# a x y
# 1: 1 1 2
# 2: 2 0 1
# 3: 3 1 1
This is basically, in case of no-match, returns 1
, and as a result gives you the first row in the group. If there is a multiple match, then it will return the first one- as per your desire
Upvotes: 15