Reputation: 7113
Lets start with some generated data which are pretty realistic:
tmp <- data.table(
label = sprintf( "X%03d", 1:500),
start = sample( 50:950, 500, replace=TRUE ),
length = round( 20 * rf( rep(1, 500), 5, 5 ), 0 )
)
DT <- tmp[ , list( t = seq( start, length.out=length ) ), by = label ]
DT[ , I := sample(1:100, 1) * dbeta( seq(from=0,to=1, length.out=length(t)), sample(3:6,1), sample(5:10,1) ), by = label ]
DT <- DT[ I > 1E-2 ]
DT
represents time series data for (in this case) 500 labels:
library(ggplot2)
ggplot( DT[ t %between% c(100,200) ], aes( x = t, y = I, group = label ) ) +
geom_line()
I want to correlate the data by all label pairs, given that they have a sufficient overlap. This is my approach:
# feel free to use just a subset here
labs <- DT[ , unique( label ) ][1:50]
# is needed for fast intersecting
setkey( DT, t )
# just needed for tracking progress
count <- 0
progress <- round(seq( from = 1, to = length(labs) * (length(labs) -1) / 2, length.out=100 ),0)
corrs <-
combn( labs, m=2, simplify=TRUE, minOverlap = 5, FUN = function( x, minOverlap ) {
# progress
count <<- count + 1
if( count %in% progress ){
cat( round( 100*count/max(progress),0 ), ".." )
}
# check overlap and correlate
a <- DT[label == x[1]]
b <- DT[label == x[2]]
iscectT <- intersect( a[ , t], b[ , t] )
n <- length(iscectT)
if( n >= minOverlap ){
R <- cor( a[J(iscectT)][, I], b[J(iscectT)][, I] )
return( c( x[1], x[2], n, min(iscectT), max(iscectT), R) )
}
else{
# only needed because of simplify = TRUE
return( rep(NA, 6) )
}
})
This works pretty fine, but is much slower than expected. In the particular case this would take up to 10 minutes on my machine.
Any help on improving the performance of this approach is highly appreciated. Questions which came to my mind:
DT
if I would deploy one of R's parallelization mechanisms, e.g. foreach
? Is there a parallelization interface for data.table
as there is for example for plyr
?combn
with simplify = FALSE
without having horrible runtimes the longer the process goes. I assume that a lot of list
copying takes place because increasing list
capacities.Upvotes: 2
Views: 408
Reputation: 7113
As Roland suggested in his comment, using combn
just to calculate the combinations of labels and then perform directly joins on the data.table
, is magnitudes faster:
corrs <- as.data.frame(do.call( rbind, combn(labs, m=2, simplify = FALSE) ), stringsAsFactors=FALSE)
names(corrs) <- c("a", "b")
setDT(corrs)
setkey(DT, label)
setkey( corrs, a )
corrs <- corrs[ DT, nomatch = 0, allow.cartesian = TRUE]
setkey(corrs, b, t)
setkey(DT, label, t)
corrs <- corrs[ DT, nomatch = 0 ]
corrs[ , overlap := .N >= minOverlap , by = list(a,b) ]
corrs <- corrs[ (overlap) ]
corrs <- corrs[ ,list( start = min(t), end = max(t), R = cor(I,I.1) ), by = list(a,b) ]
Upvotes: 3