C8H10N4O2
C8H10N4O2

Reputation: 19005

How to vectorize comparisons instead of for-loop in R?

I would like to run a discrete-time simulation (simplified version below). I generate a data frame of population members (one member per row) with their timestamps for entering and exiting a website. I then wish to count at each time interval how many members are on the site.

Currently I am looping through time and at each second counting how many members have entered and not yet exited. (I have also tried destructive iteration by removing exited members at each interval, which takes even longer. I also understand that I can use larger time intervals in the loop.)

How do I use linear algebra to eliminate the for-loop and excess runtime? My current approach does not scale well as population increases, and of course it is linear with respect to duration.

popSize = 10000
simDuration = 10000
enterTimestamp <- rexp(n = popSize, rate = .001)
exitTimestamp <- enterTimestamp + rexp(n = popSize, rate = .001)
popEvents <- data.frame(cbind(enterTimestamp,exitTimestamp))
visitorLoad <- integer(length = simDuration)
for (i in 1:simDuration) {
  visitorLoad[i] <- sum(popEvents$enterTimestamp <= i & 
                        popEvents$exitTimestamp > i)  
  if (i %% 100 == 0) {print(paste('Sim at',i,'out of',simDuration,
                      'seconds.',sep=' ') )}
}
plot(visitorLoad, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

Upvotes: 2

Views: 218

Answers (2)

Troy
Troy

Reputation: 8691

How about this for simplicity:

vl<-unlist(lapply(1:simDuration,function(i)sum((enterTimestamp<=i)*(exitTimestamp>i))))
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

It's twice as fast as current loop, but if performance is more important then @josilber 's solution is better, or maybe something with data.table(), will have a think...

EDIT - how about this for speed:

require(data.table)
require(plyr) # for count() function

system.time({

enter<-data.table(count(ceiling(enterTimestamp))) # entries grouped by second
exit<-data.table(count(ceiling(exitTimestamp)))   # exits grouped by second
sim<-data.table(x=1:simDuration)                  # time index
merged<-merge(merge(sim,enter,by="x",all.x=T),exit,by="x",all.x=T)
mat<-data.matrix(merged[,list(freq.x,freq.y)])    # make matrix to remove NAs
mat[is.na(mat)]<-0                                # remove NAs, there are quicker ways but more complicated
vl<-cumsum(mat[,1]-mat[,2])                       # cumsum() to roll up the movements

})

user  system elapsed 
0.02    0.00    0.02 

plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

** FURTHER EDIT ** - balance of performance and simplicity

system.time(cumsum(data.frame(table(cut(enterTimestamp,0:10000))-table(cut(exitTimestamp,0:10000)))[,2]))
user  system elapsed 
0.09    0.00    0.10

Upvotes: 1

josliber
josliber

Reputation: 44330

You can obtain the counts of visitors entering and exiting at different times and then use the cumulative sum to compute the number of visitors there at a particular time. This seems to meet your requirement of the code running quickly, though it does not use linear algebra.

diffs = rep(0, simDuration+1)

# Store the number of times a visitor enters and exits at each timestep. The table
# will contain headers that are the timesteps and values that are the number of
# people entering or exiting at the timestep.
tabEnter = table(pmax(1, ceiling(enterTimestamp)))
tabExit = table(pmin(simDuration+1, ceiling(exitTimestamp)))

# For each time index, add the number of people entering and subtract the number of
# people exiting. For instance, if in period 20, 3 people entered and 4 exited, then
# diffs[20] equals -1. as.numeric(names(tabEnter)) is the periods for which at least
# one person entered, and tabEnter is the number of people in each of those periods.
diffs[as.numeric(names(tabEnter))] = diffs[as.numeric(names(tabEnter))] + tabEnter
diffs[as.numeric(names(tabExit))] = diffs[as.numeric(names(tabExit))] - tabExit

# cumsum() sums the diffs vector through a particular time point. 
visitorLoad2 = head(cumsum(diffs), simDuration)

Upvotes: 4

Related Questions