Reputation: 19005
I would like to run a discrete-time simulation (simplified version below). I generate a data frame of population members (one member per row) with their timestamps for entering and exiting a website. I then wish to count at each time interval how many members are on the site.
Currently I am looping through time and at each second counting how many members have entered and not yet exited. (I have also tried destructive iteration by removing exited members at each interval, which takes even longer. I also understand that I can use larger time intervals in the loop.)
How do I use linear algebra to eliminate the for-loop and excess runtime? My current approach does not scale well as population increases, and of course it is linear with respect to duration.
popSize = 10000
simDuration = 10000
enterTimestamp <- rexp(n = popSize, rate = .001)
exitTimestamp <- enterTimestamp + rexp(n = popSize, rate = .001)
popEvents <- data.frame(cbind(enterTimestamp,exitTimestamp))
visitorLoad <- integer(length = simDuration)
for (i in 1:simDuration) {
visitorLoad[i] <- sum(popEvents$enterTimestamp <= i &
popEvents$exitTimestamp > i)
if (i %% 100 == 0) {print(paste('Sim at',i,'out of',simDuration,
'seconds.',sep=' ') )}
}
plot(visitorLoad, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
Upvotes: 2
Views: 218
Reputation: 8691
How about this for simplicity:
vl<-unlist(lapply(1:simDuration,function(i)sum((enterTimestamp<=i)*(exitTimestamp>i))))
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
It's twice as fast as current loop, but if performance is more important then @josilber 's solution is better, or maybe something with data.table()
, will have a think...
EDIT - how about this for speed:
require(data.table)
require(plyr) # for count() function
system.time({
enter<-data.table(count(ceiling(enterTimestamp))) # entries grouped by second
exit<-data.table(count(ceiling(exitTimestamp))) # exits grouped by second
sim<-data.table(x=1:simDuration) # time index
merged<-merge(merge(sim,enter,by="x",all.x=T),exit,by="x",all.x=T)
mat<-data.matrix(merged[,list(freq.x,freq.y)]) # make matrix to remove NAs
mat[is.na(mat)]<-0 # remove NAs, there are quicker ways but more complicated
vl<-cumsum(mat[,1]-mat[,2]) # cumsum() to roll up the movements
})
user system elapsed
0.02 0.00 0.02
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
** FURTHER EDIT ** - balance of performance and simplicity
system.time(cumsum(data.frame(table(cut(enterTimestamp,0:10000))-table(cut(exitTimestamp,0:10000)))[,2]))
user system elapsed
0.09 0.00 0.10
Upvotes: 1
Reputation: 44330
You can obtain the counts of visitors entering and exiting at different times and then use the cumulative sum to compute the number of visitors there at a particular time. This seems to meet your requirement of the code running quickly, though it does not use linear algebra.
diffs = rep(0, simDuration+1)
# Store the number of times a visitor enters and exits at each timestep. The table
# will contain headers that are the timesteps and values that are the number of
# people entering or exiting at the timestep.
tabEnter = table(pmax(1, ceiling(enterTimestamp)))
tabExit = table(pmin(simDuration+1, ceiling(exitTimestamp)))
# For each time index, add the number of people entering and subtract the number of
# people exiting. For instance, if in period 20, 3 people entered and 4 exited, then
# diffs[20] equals -1. as.numeric(names(tabEnter)) is the periods for which at least
# one person entered, and tabEnter is the number of people in each of those periods.
diffs[as.numeric(names(tabEnter))] = diffs[as.numeric(names(tabEnter))] + tabEnter
diffs[as.numeric(names(tabExit))] = diffs[as.numeric(names(tabExit))] - tabExit
# cumsum() sums the diffs vector through a particular time point.
visitorLoad2 = head(cumsum(diffs), simDuration)
Upvotes: 4