Performance: combn on large data.table

Question

Lets start with some generated data which are pretty realistic:

tmp <- data.table(
  label  = sprintf( "X%03d", 1:500),
  start  = sample( 50:950, 500, replace=TRUE ),
  length = round( 20 *  rf( rep(1, 500), 5, 5 ), 0 )
)
DT <- tmp[ , list( t = seq( start, length.out=length ) ), by = label ]
DT[ , I := sample(1:100, 1) * dbeta( seq(from=0,to=1, length.out=length(t)), sample(3:6,1), sample(5:10,1) ), by = label ]
DT <- DT[ I > 1E-2 ]

DT represents time series data for (in this case) 500 labels:

library(ggplot2)
ggplot( DT[ t %between% c(100,200) ], aes( x = t, y = I, group = label ) ) +
  geom_line()

enter image description here

I want to correlate the data by all label pairs, given that they have a sufficient overlap. This is my approach:

# feel free to use just a subset here
labs <- DT[ , unique( label )  ][1:50]
# is needed for fast intersecting
setkey( DT, t )
# just needed for tracking progress    
count <- 0
progress <- round(seq( from = 1, to = length(labs) * (length(labs) -1) / 2, length.out=100 ),0)

corrs <- 
  combn( labs, m=2, simplify=TRUE, minOverlap = 5, FUN = function( x, minOverlap ) {

    # progress
    count <<- count + 1
    if( count %in% progress ){
      cat( round( 100*count/max(progress),0 ), ".." )
    }

    # check overlap and correlate
    a <- DT[label == x[1]]
    b <- DT[label == x[2]]        
    iscectT <- intersect( a[ , t], b[ , t] )
    n  <- length(iscectT)
    if( n >= minOverlap ){
      R <- cor( a[J(iscectT)][, I], b[J(iscectT)][, I] )
      return( c( x[1], x[2], n, min(iscectT), max(iscectT), R) )
    }
    else{
      # only needed because of simplify = TRUE
      return( rep(NA, 6)  )
    }
  })

This works pretty fine, but is much slower than expected. In the particular case this would take up to 10 minutes on my machine.

Any help on improving the performance of this approach is highly appreciated. Questions which came to my mind:

Do I have to expect any side effects concerning on DTif I would deploy one of R's parallelization mechanisms, e.g. foreach? Is there a parallelization interface for data.table as there is for example for plyr?
Is there a way of using combn with simplify = FALSE without having horrible runtimes the longer the process goes. I assume that a lot of list copying takes place because increasing list capacities.
Is there anything I can do on the algorithmic side to make this faster?

Beasterfield · Accepted Answer

As Roland suggested in his comment, using combn just to calculate the combinations of labels and then perform directly joins on the data.table, is magnitudes faster:

corrs <- as.data.frame(do.call( rbind, combn(labs, m=2, simplify = FALSE) ), stringsAsFactors=FALSE)
names(corrs) <- c("a", "b")
setDT(corrs)

setkey(DT, label)
setkey( corrs, a )

corrs <- corrs[ DT, nomatch = 0, allow.cartesian = TRUE]
setkey(corrs, b, t)
setkey(DT, label, t)

corrs <- corrs[ DT, nomatch = 0 ]
corrs[ , overlap := .N >= minOverlap , by = list(a,b) ]
corrs <- corrs[ (overlap) ]
corrs <- corrs[ ,list( start = min(t), end = max(t), R = cor(I,I.1) ), by = list(a,b) ]

Performance: combn on large data.table

Answers (1)

Related Questions