R data.table: calculating grouped frequencies

Question

I'm trying to add columns to my data.table that essentially append a cumulative frequency table for each group that is aggregated. Unfortunately, my current solution is about ten times slower than I had hoped.

Here is what I'm using (apologies for the ugly one-liner):

DT[, c("bin1","bin2","bin3","bin4") := as.list(cumsum(hist(colx,c(lbound,bound1,bound2, bound3,ubound),plot=FALSE)$counts)), by=category]

If the bin boundaries are set at 0,25,50,75,100, I would like my table to look like:

id category colx bin1 bin2 bin3 bin4
1  a        5    1    2    2    3
2  a        30   1    2    2    3
3  b        21   1    2    3    4
4  c        62   0    1    3    3
5  b        36   1    2    3    4
6  a        92   1    2    2    3
7  c        60   0    1    3    3
8  b        79   1    2    3    4
9  b        54   1    2    3    4
10 c        27   0    1    3    3

In the actual dataset I'm grouping using 4 different columns and there are millions of rows and unique groups. When I try a simpler function, such as sum, it takes an acceptable amount of time to do the calculation. Is there any way to significantly speed up the counting process?

Arun · Accepted Answer

Okay, here's one way (here I use data.table v1.9.3). Remove the by=.EACHI if you're using versions <= 1.9.2.

dt[, ival := findInterval(colx, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, category, ival)
ans <- dt[CJ(unique(category), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="category"][, bin := "bin"]
ans <- dcast.data.table(ans, category ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]

    id category colx bin_1 bin_2 bin_3 bin_4
 1:  1        a    5     1     2     2     3
 2:  2        a   30     1     2     2     3
 3:  6        a   92     1     2     2     3
 4:  3        b   21     1     2     3     4
 5:  5        b   36     1     2     3     4
 6:  9        b   54     1     2     3     4
 7:  8        b   79     1     2     3     4
 8: 10        c   27     0     1     3     3
 9:  4        c   62     0     1     3     3
10:  7        c   60     0     1     3     3

Benchmark on simulated large data:

I generate here a data.table with 20 million rows and a total of 1-million groups with 2 grouping columns (instead of 4 as you state in your question).

K = 1e3L
N = 20e6L
sim_data <- function(K, N) {
    set.seed(1L)
    ff <- function(K, N) sample(paste0("V", 1:K), N, TRUE)
    data.table(x=ff(K,N), y=ff(K,N), val=sample(1:100, N, TRUE))
}

dt <- sim_data(K, N)
method1 <- function(x) { 
    dt[, ival := findInterval(val, seq(0, 100, by=25), rightmost.closed=TRUE)]
    setkey(dt, x, y, ival)
    ans <- dt[CJ(unique(x), unique(y), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
    ans[, N := cumsum(N), by="x,y"][, bin := "bin"]
    ans <- dcast.data.table(ans, x+y ~ bin+ival, value.var="N")
    ans <- dt[ans][, ival := NULL]
}

system.time(ans1 <- method1(dt))
#   user  system elapsed 
# 13.148   2.778  16.209

I hope this is faster than your original solution and scales well for your real data dimensions.

Update: Here's another version using data.table's rolling joins instead of findInterval from base. We've to modify the intervals slightly so that the rolling join finds the right match.

dt <- sim_data(K, N)
method2 <- function(x) {
    ivals = seq(24L, 100L, by=25L)
    ivals[length(ivals)] = 100L
    setkey(dt, x,y,val)
    dt[, ival := seq_len(.N), by="x,y"]
    ans <- dt[CJ(unique(x), unique(y), ivals), roll=TRUE, mult="last"][is.na(ival), ival := 0L][, bin := "bin"]
    ans <- dcast.data.table(ans, x+y~bin+val, value.var="ival")
    dt[, ival := NULL]
    ans2 <- dt[ans]
}

system.time(ans2 <- method2(dt))
#   user  system elapsed 
# 12.538   2.649  16.079 

## check if both methods give identical results:

setkey(ans1, x,y,val)
setnames(ans2, copy(names(ans1)))
setkey(ans2, x,y,val)

identical(ans1, ans2) # [1] TRUE

Edit: Some explanation on why OP's is very time consuming:

A huge reason, I suspect, for the difference in runtime between these solutions and hist is that both the answers here are vectorised (written entirely in C and will work on the whole data set directly), where as hist is a S3 method (which'll take time for dispatch to the .default method and added to that, it's written in R. So, basically you're executing about a million times hist, a function in R, where as the other two vectorised solutions are calling it once in C (no need to call for every group here).

And since that's the most complex part of your question, it obviously slows things down.

R data.table: calculating grouped frequencies

Answers (1)

Benchmark on simulated large data:

Related Questions