astro person
astro person

Reputation: 381

data.table: Find the minimum of a subset of each group

If A, B and C are columns in a data.table, dt, what I would like to do is: create a new column (D), which is the minimum of each group in C, filtered so that I only consider rows in the group where B is true. So far what I have is...

dt[, D := min(A[B == TRUE]), by = C]

This works and I get the result that I'm after, but it's pretty slow (my data set is a few million rows). Is there a faster, more elegant way to do it? I'm only interested in results using data.table, no dplyr please :)

Upvotes: 0

Views: 2336

Answers (2)

Uwe
Uwe

Reputation: 42544

There is an alternative approach which computes min() for each group after filtering and then does an update join. Missing groups will be set to NA, thereby avoiding the type conversion from integer to double.

# dummy data
dt <- data.table(A = rep(1:3, 3),
                 B = c(rep(c(FALSE, TRUE, TRUE), 2), rep(FALSE, 3)),
                 C = 10L * rep(1:3, each = 3))
dt[, A := A + C]
dt
    A     B  C
1: 11 FALSE 10
2: 12  TRUE 10
3: 13  TRUE 10
4: 21 FALSE 20
5: 22  TRUE 20
6: 23  TRUE 20
7: 31 FALSE 30
8: 32 FALSE 30
9: 33 FALSE 30
# all variables are integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame':    9 obs. of  3 variables:
 $ A: int  11 12 13 21 22 23 31 32 33
 $ B: logi  FALSE TRUE TRUE FALSE TRUE TRUE ...
 $ C: int  10 10 10 20 20 20 30 30 30
 - attr(*, ".internal.selfref")=<externalptr>
dt[dt[(B), min(A), by = C], on = "C", D := V1][]
    A     B  C  D
1: 11 FALSE 10 12
2: 12  TRUE 10 12
3: 13  TRUE 10 12
4: 21 FALSE 20 22
5: 22  TRUE 20 22
6: 23  TRUE 20 22
7: 31 FALSE 30 NA
8: 32 FALSE 30 NA
9: 33 FALSE 30 NA
# all variables are still integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame':    9 obs. of  4 variables:
 $ A: int  11 12 13 21 22 23 31 32 33
 $ B: logi  FALSE TRUE TRUE FALSE TRUE TRUE ...
 $ C: int  10 10 10 20 20 20 30 30 30
 $ D: int  12 12 12 22 22 22 NA NA NA
 - attr(*, ".internal.selfref")=<externalptr>

The expression

dt[(B), min(A), by = C]

returns the min values by group (if available)

    C V1
1: 10 12
2: 20 22

NB: I do not claim that this is faster than OP's approach. I cannot test it due to lack of a reproducible example which can be scaled in problem size for benchmarking.

Upvotes: 1

astro person
astro person

Reputation: 381

I ended up converting column A to a numeric type (from integer), so that empty groups return a minimum of NA, and then used pmin.int() instead of min(). As it turns out, this is about 4 times faster than my original method!

Upvotes: 0

Related Questions