Beasterfield
Beasterfield

Reputation: 7113

Performance: combn on large data.table

Lets start with some generated data which are pretty realistic:

tmp <- data.table(
  label  = sprintf( "X%03d", 1:500),
  start  = sample( 50:950, 500, replace=TRUE ),
  length = round( 20 *  rf( rep(1, 500), 5, 5 ), 0 )
)
DT <- tmp[ , list( t = seq( start, length.out=length ) ), by = label ]
DT[ , I := sample(1:100, 1) * dbeta( seq(from=0,to=1, length.out=length(t)), sample(3:6,1), sample(5:10,1) ), by = label ]
DT <- DT[ I > 1E-2 ]

DT represents time series data for (in this case) 500 labels:

library(ggplot2)
ggplot( DT[ t %between% c(100,200) ], aes( x = t, y = I, group = label ) ) +
  geom_line()

enter image description here

I want to correlate the data by all label pairs, given that they have a sufficient overlap. This is my approach:

# feel free to use just a subset here
labs <- DT[ , unique( label )  ][1:50]
# is needed for fast intersecting
setkey( DT, t )
# just needed for tracking progress    
count <- 0
progress <- round(seq( from = 1, to = length(labs) * (length(labs) -1) / 2, length.out=100 ),0)

corrs <- 
  combn( labs, m=2, simplify=TRUE, minOverlap = 5, FUN = function( x, minOverlap ) {

    # progress
    count <<- count + 1
    if( count %in% progress ){
      cat( round( 100*count/max(progress),0 ), ".." )
    }

    # check overlap and correlate
    a <- DT[label == x[1]]
    b <- DT[label == x[2]]        
    iscectT <- intersect( a[ , t], b[ , t] )
    n  <- length(iscectT)
    if( n >= minOverlap ){
      R <- cor( a[J(iscectT)][, I], b[J(iscectT)][, I] )
      return( c( x[1], x[2], n, min(iscectT), max(iscectT), R) )
    }
    else{
      # only needed because of simplify = TRUE
      return( rep(NA, 6)  )
    }
  })

This works pretty fine, but is much slower than expected. In the particular case this would take up to 10 minutes on my machine.

Any help on improving the performance of this approach is highly appreciated. Questions which came to my mind:

Upvotes: 2

Views: 408

Answers (1)

Beasterfield
Beasterfield

Reputation: 7113

As Roland suggested in his comment, using combn just to calculate the combinations of labels and then perform directly joins on the data.table, is magnitudes faster:

corrs <- as.data.frame(do.call( rbind, combn(labs, m=2, simplify = FALSE) ), stringsAsFactors=FALSE)
names(corrs) <- c("a", "b")
setDT(corrs)

setkey(DT, label)
setkey( corrs, a )

corrs <- corrs[ DT, nomatch = 0, allow.cartesian = TRUE]
setkey(corrs, b, t)
setkey(DT, label, t)

corrs <- corrs[ DT, nomatch = 0 ]
corrs[ , overlap := .N >= minOverlap , by = list(a,b) ]
corrs <- corrs[ (overlap) ]
corrs <- corrs[ ,list( start = min(t), end = max(t), R = cor(I,I.1) ), by = list(a,b) ]  

Upvotes: 3

Related Questions