Rahul Premraj
Rahul Premraj

Reputation: 1595

Counting number of events underway at a timestamp

I have a series of timestamps marking the beginning and end of certain events.

library(chron)
start <- structure(c(14246.3805439815, 14246.3902662037, 14246.3909606481, 
14246.3992939815, 14246.4013773148, 14246.4034606481, 14246.4062384259, 
14246.4069328704, 14246.4069328704, 14246.4097106481, 14246.4097106481, 
14246.4104050926, 14246.4117939815, 14246.4117939815, 14246.4117939815, 
14246.4145717593, 14246.4152546296, 14246.4152662037, 14246.4152662037, 
14246.4159606481), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates", 
"times")), origin = structure(c(1, 1, 1970), .Names = c("month", 
"day", "year")), class = c("chron", "dates", "times"))

finish <- structure(c(14246.436099537, 14246.4666550926, 14246.4083217593, 
14246.4374884259, 14246.4847106481, 14246.4867939815, 14246.4305439815, 
14246.4659606481, 14246.4520717593, 14246.9097106481, 14246.4930439815, 
14246.4763773148, 14246.4326273148, 14246.4291550926, 14246.4187384259, 
14246.9145717593, 14246.4395601852, 14246.4395717593, 14246.4395717593, 
14246.4367939815), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates", 
"times")), origin = structure(c(1, 1, 1970), .Names = c("month", 
"day", "year")), class = c("chron", "dates", "times"))

events <- data.frame(start, finish)
head(event, 5)

                start              finish
1 (01/02/09 09:07:59) (01/02/09 10:27:59)
2 (01/02/09 09:21:59) (01/02/09 11:11:59)
3 (01/02/09 09:22:59) (01/02/09 09:47:59)
4 (01/02/09 09:34:59) (01/02/09 10:29:59)
5 (01/02/09 09:37:59) (01/02/09 11:37:59)

I now wish to count how many events are underway at specific timestamps.

intervals <- structure(c(14246.3958333333, 14246.40625, 14246.4166666667, 
14246.4270833333, 14246.4375), format = structure(c("m/d/y", 
"h:m:s"), .Names = c("dates", "times")), origin = structure(c(1, 
1, 1970), .Names = c("month", "day", "year")), class = c("chron", 
"dates", "times"))

intervals

[1] (01/02/09 09:30:00) (01/02/09 09:45:00) (01/02/09 10:00:00) (01/02/09 10:15:00) (01/02/09 10:30:00)

So the output I desire is as follows:

            intervals count
1 (01/01/09 09:30:00)     3
2 (01/01/09 09:45:00)     7
3 (01/01/09 10:00:00)    19
4 (01/01/09 10:15:00)    18
5 (01/01/09 10:30:00)    12

While the problem is trivial to solve programatically, I wish to accomplish this for 210,000 intervals and over 1.2 million events. My current approach involves leveraging the data.table package and the & operator to check whether an interval lies between the start and end time of each event.

library(data.table)
events <- data.table(events)
data.frame(intervals, count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))

But considering the size of my data, this approach takes a very long time to run. Any advice on better alternatives to accomplish this in R?

Cheers.

Upvotes: 5

Views: 3014

Answers (2)

Andrie
Andrie

Reputation: 179398

The secret of fast performing code in R is to keep everything in vector, or arrays, which are really just arrays in disguise.

Here is a solution that makes use exclusively of base R arrays. Your sample of data is tiny so I use replicate and system.time combined to measure performance.

My solution is roughly 6 times faster than your solution with sapply and data.table. (My solution takes 0.6 seconds to solve your small sample data set 1,000 times.)

Timing your solution

system.time(replicate(1000, 
    XX <- data.frame(
      intervals, 
      count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))
))

   user  system elapsed 
   4.04    0.05    4.11 

My solution. First create two helper functions to create equal sized arrays with events running down the columns and intervals running across the rows. Then do a simple vector comparison followed by colSums:

event.array <- function(x, interval){
  len <- length(interval)
  matrix(rep(unclass(x), len), ncol=len)
}

intervals.array <- function(x, intervals){
  len <- length(x)
  matrix(rep(unclass(intervals), len), nrow=len, byrow=TRUE)
} 


a.start <- event.array(start, intervals)
a.finish <- event.array(finish, intervals)
a.intervals <- intervals.array(start, intervals)

data.frame(intervals, 
           count=colSums(a.start <= a.intervals & a.finish >= a.intervals))

            intervals count
1 (01/02/09 09:30:00)     3
2 (01/02/09 09:45:00)     7
3 (01/02/09 10:00:00)    19
4 (01/02/09 10:15:00)    18
5 (01/02/09 10:30:00)    12

Timing my solution

system.time(replicate(1000, 
  YY <- data.frame(
          intervals, 
          count=colSums(a.start <= a.intervals & a.finish >= a.intervals))
))

   user  system elapsed 
   0.67    0.02    0.69 

all.equal(XX, YY)
[1] TRUE

Upvotes: 3

dnagirl
dnagirl

Reputation: 20456

Perhaps using dim() instead of sum() and ldply() instead of sapply() might be faster?

b<-function(i,df){ data.frame(i, count=dim(df[with(df, start<i & finish> i),])[1])};
ldply(intervals, b, events);

         i count
1 14246.40     3
2 14246.41     7
3 14246.42    19
4 14246.43    18
5 14246.44    12

I'm not familiar with the chron library so I did't make i come out as a timestamp. Sorry.

Upvotes: 0

Related Questions