Han Xu
Han Xu

Reputation: 185

Aggregating data table in R with information outside the table

I have a data table like as below, and I would like to aggregate all the values within a 5s time range of another vector of timestamp.

If my description confuses you, please see the following toy example

trade_hist <-data.table(timestamp=seq(1,200,2),value=rep(1:5,20))
TS <- seq(2,200,2)
> head(trade_hist)
   timestamp value
1:         1     1
2:         3     2
3:         5     3
4:         7     4
5:         9     5
6:        11     1
> head(TS)
[1]  2  4  6  8 10 12 

I eventually want to create new table with timetamp as in TS and aggregate the values in original table with certain condition (say sum up all values within 5s time range prior to that time stamp). I can accomplish this with the following code:

list_of_values <- vector("list",length(TS))
for(i in 1:length(TS)){
        ts <- TS[i]
        tmp <- trade_hist[(ts-timestamp <= 5) & ts-timestamp > 0]        
        tmp <- tmp[,.(sumVal=sum(value))]
        list_of_values[[i]] <- tmp
}

newcol <- rbindlist(list_of_values)
result_wanted <- cbind(TS,newcol)

> head(result_wanted)
   TS sumVal
1:  2      1
2:  4      3
3:  6      6
4:  8      9
5: 10     12
6: 12     10

For example, when TS=6, look back to the original table for 5s range prior to that we have value=1,2,3 thereby the sum is 6; when TS=8, look back to the original table we have value=2,3,4, sum is 9.

However the code above is extremely inefficient for bigger table, is there some other way around? Thx!

Upvotes: 2

Views: 112

Answers (2)

Han Xu
Han Xu

Reputation: 185

Thanks @MichaelChirico for suggesting forverlap I wrote a solution for my question using forverlap and it is really quick!

x <- data.table(start=TS,end=TS)
y <- trade_hist[,.(start=timestamp,end=timestamp+5,value)]
setkey(y, start, end)
tmp <- foverlaps(x, y, type="within")
result <- tmp[,.(value=sum(value)),by=i.start]
setnames(result,"i.start","timestamp")

Upvotes: 2

alistaire
alistaire

Reputation: 43334

With dplyr,

sumVal <- sapply(TS, function(x){
    sum(trade_hist %>% 
        filter(timestamp >= x-5, timestamp <= x) %>% 
        select(value))
})
result_wanted <- data.frame(TS, sumVal)

which will be considerably faster than the original. data.table or Rcpp could probably make it faster yet, if your data is truly huge.

Keep in mind that if TS and trade_hist$timestamp are actually date-time objects, they need to be in a format that can handle subtraction and greater than/less than operators, or those operations need to be changed to a suitable one. lubridate or chron probably has what you need if the base POSIXlt/POSIXct isn't cutting it.

Upvotes: 1

Related Questions