Reputation: 381
If A, B and C are columns in a data.table, dt, what I would like to do is: create a new column (D), which is the minimum of each group in C, filtered so that I only consider rows in the group where B is true. So far what I have is...
dt[, D := min(A[B == TRUE]), by = C]
This works and I get the result that I'm after, but it's pretty slow (my data set is a few million rows). Is there a faster, more elegant way to do it? I'm only interested in results using data.table, no dplyr please :)
Upvotes: 0
Views: 2336
Reputation: 42544
There is an alternative approach which computes min()
for each group after filtering and then does an update join. Missing groups will be set to NA
, thereby avoiding the type conversion from integer to double.
# dummy data
dt <- data.table(A = rep(1:3, 3),
B = c(rep(c(FALSE, TRUE, TRUE), 2), rep(FALSE, 3)),
C = 10L * rep(1:3, each = 3))
dt[, A := A + C]
dt
A B C 1: 11 FALSE 10 2: 12 TRUE 10 3: 13 TRUE 10 4: 21 FALSE 20 5: 22 TRUE 20 6: 23 TRUE 20 7: 31 FALSE 30 8: 32 FALSE 30 9: 33 FALSE 30
# all variables are integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame': 9 obs. of 3 variables: $ A: int 11 12 13 21 22 23 31 32 33 $ B: logi FALSE TRUE TRUE FALSE TRUE TRUE ... $ C: int 10 10 10 20 20 20 30 30 30 - attr(*, ".internal.selfref")=<externalptr>
dt[dt[(B), min(A), by = C], on = "C", D := V1][]
A B C D 1: 11 FALSE 10 12 2: 12 TRUE 10 12 3: 13 TRUE 10 12 4: 21 FALSE 20 22 5: 22 TRUE 20 22 6: 23 TRUE 20 22 7: 31 FALSE 30 NA 8: 32 FALSE 30 NA 9: 33 FALSE 30 NA
# all variables are still integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables: $ A: int 11 12 13 21 22 23 31 32 33 $ B: logi FALSE TRUE TRUE FALSE TRUE TRUE ... $ C: int 10 10 10 20 20 20 30 30 30 $ D: int 12 12 12 22 22 22 NA NA NA - attr(*, ".internal.selfref")=<externalptr>
The expression
dt[(B), min(A), by = C]
returns the min values by group (if available)
C V1 1: 10 12 2: 20 22
NB: I do not claim that this is faster than OP's approach. I cannot test it due to lack of a reproducible example which can be scaled in problem size for benchmarking.
Upvotes: 1
Reputation: 381
I ended up converting column A to a numeric type (from integer), so that empty groups return a minimum of NA, and then used pmin.int() instead of min(). As it turns out, this is about 4 times faster than my original method!
Upvotes: 0