Most efficient way to filter data.table based on an aggregated value

Question

What is the recommended / "best" way (wrt performance) to filter a data.table based on some criteria calculated on an aggregated form of this very table.

A reprex speaks more than 1000 words:

library(data.table)
DT <- data.table(grp = rep(LETTERS[1:3], each = 3), x = 1:9)
setkey(DT, "grp")
DT[DT[, .(nok = any(x == 4)), by = grp][nok == FALSE]]
DT[DT[, .GRP[all(x != 4)], by = grp]]

I could think of these 2 solutions immediately and my gut feeling tells me the second form should be "better" (smaller intermediate tables are stored and I do not need to chain the results), but I was wondering if there is a canonical form of doing this?

Maybe I do not need to use a join in the first place and can use a grouped filter for the i argument?

This does obviously not work as intended (by has apparently only an effect on j):

DT[all(x != 4), by = grp]

While this SO answer shows yet another way of doing the same, my main concern is about the performance. Thus, I would like to know which of these options will generally scale well to large tables, if I want to further work on the filtered data.table (that is use another j expression on the filtered result)

In my real case scenario, I have about 16 Mio rows, with about 40k unique keys and 14 columns.

Thus a benchmark data set could look as follows:

bench <- data.table(keys = rep(paste0("k", 1:40000), 400))
bench[, paste0("cols", 1:13) := replicate(13, sample(40000 * 400, TRUE), 
                                          simplify = FALSE)]

While I am looking for a generic answer (if possible) irrespective of the final filter selected, the actual filter will be to find out which groups contain any NA value.

chinsoon12 · Accepted Answer

I do not think that there is a generic way that applies to all situations. It depends on the characteristics of the dataset and also the filtering criteria. The dataset might have multiple small groups or integer keys while the filtering can be implemented using a faster low-level programming approach.

Here are a few options with regards to your actual problem (i.e. filtering for groups with NAs in one of the columns):

DT_keys <- copy(DT)
system.time(setkey(DT_keys, keys))
#   user  system elapsed 
#   1.50    0.67    1.32 

DT_cols1 <- copy(DT)
system.time(setkey(DT_cols1, cols1))
#   user  system elapsed 
#   4.21    0.21    1.30 

microbenchmark::microbenchmark(times=1L,
    m0 = DT_keys[, keys[is.na(cols1)], keys]$keys,
    m1 = DT_keys[, if (anyNA(cols1)) keys, keys]$keys,
    m2 = DT_cols1[.(NA_integer_)]$keys
)

timings for the 16 million rows of dummy data:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
   m0 90.675005 90.675005 90.675005 90.675005 90.675005 90.675005     1
   m1 56.548620 56.548620 56.548620 56.548620 56.548620 56.548620     1
   m2  4.010301  4.010301  4.010301  4.010301  4.010301  4.010301     1

The timings are very fast for the size of your actual dataset. Not much significant time to be saved unless you run the filtering hundreds of time. Maybe you might want to save some other type of timings beside runtime.

data:

library(data.table)
set.seed(0L)
nk <- 4e4L
nn <- 400L
DT <- data.table(keys = rep(paste0("k", 1L:nk), nn))
DT[, paste0("cols", 1L:13L) := 
    replicate(13L, sample(c(NA_integer_, 1L:nk), nk * nn, TRUE), simplify = FALSE)]

Most efficient way to filter data.table based on an aggregated value

Answers (2)

Related Questions