Reputation: 1595
I have a series of timestamps marking the beginning and end of certain events.
library(chron)
start <- structure(c(14246.3805439815, 14246.3902662037, 14246.3909606481,
14246.3992939815, 14246.4013773148, 14246.4034606481, 14246.4062384259,
14246.4069328704, 14246.4069328704, 14246.4097106481, 14246.4097106481,
14246.4104050926, 14246.4117939815, 14246.4117939815, 14246.4117939815,
14246.4145717593, 14246.4152546296, 14246.4152662037, 14246.4152662037,
14246.4159606481), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates",
"times")), origin = structure(c(1, 1, 1970), .Names = c("month",
"day", "year")), class = c("chron", "dates", "times"))
finish <- structure(c(14246.436099537, 14246.4666550926, 14246.4083217593,
14246.4374884259, 14246.4847106481, 14246.4867939815, 14246.4305439815,
14246.4659606481, 14246.4520717593, 14246.9097106481, 14246.4930439815,
14246.4763773148, 14246.4326273148, 14246.4291550926, 14246.4187384259,
14246.9145717593, 14246.4395601852, 14246.4395717593, 14246.4395717593,
14246.4367939815), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates",
"times")), origin = structure(c(1, 1, 1970), .Names = c("month",
"day", "year")), class = c("chron", "dates", "times"))
events <- data.frame(start, finish)
head(event, 5)
start finish
1 (01/02/09 09:07:59) (01/02/09 10:27:59)
2 (01/02/09 09:21:59) (01/02/09 11:11:59)
3 (01/02/09 09:22:59) (01/02/09 09:47:59)
4 (01/02/09 09:34:59) (01/02/09 10:29:59)
5 (01/02/09 09:37:59) (01/02/09 11:37:59)
I now wish to count how many events are underway at specific timestamps.
intervals <- structure(c(14246.3958333333, 14246.40625, 14246.4166666667,
14246.4270833333, 14246.4375), format = structure(c("m/d/y",
"h:m:s"), .Names = c("dates", "times")), origin = structure(c(1,
1, 1970), .Names = c("month", "day", "year")), class = c("chron",
"dates", "times"))
intervals
[1] (01/02/09 09:30:00) (01/02/09 09:45:00) (01/02/09 10:00:00) (01/02/09 10:15:00) (01/02/09 10:30:00)
So the output I desire is as follows:
intervals count
1 (01/01/09 09:30:00) 3
2 (01/01/09 09:45:00) 7
3 (01/01/09 10:00:00) 19
4 (01/01/09 10:15:00) 18
5 (01/01/09 10:30:00) 12
While the problem is trivial to solve programatically, I wish to accomplish this for 210,000 intervals and over 1.2 million events. My current approach involves leveraging the data.table
package and the &
operator to check whether an interval lies between the start and end time of each event.
library(data.table)
events <- data.table(events)
data.frame(intervals, count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))
But considering the size of my data, this approach takes a very long time to run. Any advice on better alternatives to accomplish this in R?
Cheers.
Upvotes: 5
Views: 3014
Reputation: 179398
The secret of fast performing code in R is to keep everything in vector, or arrays, which are really just arrays in disguise.
Here is a solution that makes use exclusively of base R arrays. Your sample of data is tiny so I use replicate
and system.time
combined to measure performance.
My solution is roughly 6 times faster than your solution with sapply
and data.table
. (My solution takes 0.6 seconds to solve your small sample data set 1,000 times.)
Timing your solution
system.time(replicate(1000,
XX <- data.frame(
intervals,
count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))
))
user system elapsed
4.04 0.05 4.11
My solution. First create two helper functions to create equal sized arrays with events running down the columns and intervals running across the rows. Then do a simple vector comparison followed by colSums
:
event.array <- function(x, interval){
len <- length(interval)
matrix(rep(unclass(x), len), ncol=len)
}
intervals.array <- function(x, intervals){
len <- length(x)
matrix(rep(unclass(intervals), len), nrow=len, byrow=TRUE)
}
a.start <- event.array(start, intervals)
a.finish <- event.array(finish, intervals)
a.intervals <- intervals.array(start, intervals)
data.frame(intervals,
count=colSums(a.start <= a.intervals & a.finish >= a.intervals))
intervals count
1 (01/02/09 09:30:00) 3
2 (01/02/09 09:45:00) 7
3 (01/02/09 10:00:00) 19
4 (01/02/09 10:15:00) 18
5 (01/02/09 10:30:00) 12
Timing my solution
system.time(replicate(1000,
YY <- data.frame(
intervals,
count=colSums(a.start <= a.intervals & a.finish >= a.intervals))
))
user system elapsed
0.67 0.02 0.69
all.equal(XX, YY)
[1] TRUE
Upvotes: 3
Reputation: 20456
Perhaps using dim()
instead of sum()
and ldply()
instead of sapply()
might be faster?
b<-function(i,df){ data.frame(i, count=dim(df[with(df, start<i & finish> i),])[1])};
ldply(intervals, b, events);
i count
1 14246.40 3
2 14246.41 7
3 14246.42 19
4 14246.43 18
5 14246.44 12
I'm not familiar with the chron library so I did't make i
come out as a timestamp. Sorry.
Upvotes: 0