Christian Arnold
Christian Arnold

Reputation: 168

Fast processing of large tables and table columns - apply too slow

I have been using R for a long time now, and I have working code for a specific little task. But I was wondering if there is any faster solution.

The problem is simple: I have a data frame tbl with two columns ID and NrBlocks. IDs are not unique and may appear multiple times, but with the same or different NrBlocks. The table has more columns in reality but these details are irrelevant here. All I want is the sum of the NrBlocks values for each unique ID.

Working code (before renaming, I hope I did not introduce typos due to the simplification here):

uniqueIDs = unique(tbl$ID) #Precompute once
sapply(1:length(uniqueIDs),
       FUN = function(x){
         sum(tbl[which(tbl$ID == uniqueIDs[x]),]$NrBlocks)
       }
)

Any suggestions for a speed improvement?

Upvotes: 1

Views: 173

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99331

If you have 500 milliseconds to spare, you can do this with aggregate in one easy line.

aggregate(NrBlocks ~ ID, dat, mean)
microbenchmark(aggregate(NrBlocks ~ ID, dat, mean))
# Unit: milliseconds
#  expr     min       lq   median       uq      max neval
#   f() 655.218 655.6562 656.6428 661.6947 667.0888   100

Where dat is a data frame created from nrussell's table.

dim(dat)
# [1] 10000 2

Upvotes: 0

nrussell
nrussell

Reputation: 18602

Obligatory data.table solution -

options(stringsAsFactors=FALSE)
library(data.table)
##
set.seed(1234)
dTbl <- data.table(
  ID = sample(c(letters,LETTERS),100000,replace=TRUE),
  NrBlocks = rnorm(100000),
  key = "ID")
##
gTbl <- dTbl[
  ,
  list(sumNrBlocks = sum(NrBlocks)),
  by = list(ID)]
##
> head(gTbl)
   ID sumNrBlocks
1:  A    56.50234
2:  B   -13.61380
3:  C    24.66750
4:  D    65.18829
5:  E    26.14085
6:  F    41.64376

Timings:

library(microbenchmark)
##
uniqueIDs <- unique(dTbl$ID)
f1 <- function(){
  sapply(1:length(uniqueIDs),
         FUN = function(x){
           sum(dTbl[which(dTbl$ID == uniqueIDs[x]),]$NrBlocks)
         }
  )
}
##
f2 <- function(){
  dTbl[
    ,
    list(sumNrBlocks = sum(NrBlocks)),
    by = list(ID)]
}
##
Res <- microbenchmark(
  f1(),
  f2(),
  times=100L)
Res
> Res
Unit: milliseconds
 expr        min         lq     median         uq        max neval
 f1() 139.054620 141.534227 144.213253 156.747569 193.278071   100
 f2()   1.813652   1.911069   1.980874   2.140971   3.522545   100

Multiple columns:

dTbl2 <- copy(dTbl)
set.seed(1234)
dTbl2[,col3:=rexp(100000)]
dTbl2[,col4:=col3*2]
##
gTbl2 <- dTbl2[
  ,
  lapply(.SD,sum),
  by=list(ID)]
##
> head(gTbl2)
   ID  NrBlocks     col3     col4
1:  A  56.50234 1933.443 3866.886
2:  B -13.61380 1904.282 3808.563
3:  C  24.66750 1834.655 3669.310
4:  D  65.18829 1884.364 3768.728
5:  E  26.14085 1874.761 3749.523
6:  F  41.64376 1977.219 3954.438

Multiple columns with specification -

gTbl2.2 <- dTbl2[
  ,
  lapply(.SD,sum),
  by=list(ID),
  .SDcols=c(2,4)]
##
> head(gTbl2.2)
   ID  NrBlocks     col4
1:  A  56.50234 3866.886
2:  B -13.61380 3808.563
3:  C  24.66750 3669.310
4:  D  65.18829 3768.728
5:  E  26.14085 3749.523
6:  F  41.64376 3954.438

Upvotes: 1

Related Questions