How to vectorize comparisons instead of for-loop in R?

Question

I would like to run a discrete-time simulation (simplified version below). I generate a data frame of population members (one member per row) with their timestamps for entering and exiting a website. I then wish to count at each time interval how many members are on the site.

Currently I am looping through time and at each second counting how many members have entered and not yet exited. (I have also tried destructive iteration by removing exited members at each interval, which takes even longer. I also understand that I can use larger time intervals in the loop.)

How do I use linear algebra to eliminate the for-loop and excess runtime? My current approach does not scale well as population increases, and of course it is linear with respect to duration.

popSize = 10000
simDuration = 10000
enterTimestamp <- rexp(n = popSize, rate = .001)
exitTimestamp <- enterTimestamp + rexp(n = popSize, rate = .001)
popEvents <- data.frame(cbind(enterTimestamp,exitTimestamp))
visitorLoad <- integer(length = simDuration)
for (i in 1:simDuration) {
  visitorLoad[i] <- sum(popEvents$enterTimestamp <= i & 
                        popEvents$exitTimestamp > i)  
  if (i %% 100 == 0) {print(paste('Sim at',i,'out of',simDuration,
                      'seconds.',sep=' ') )}
}
plot(visitorLoad, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

josliber · Accepted Answer

You can obtain the counts of visitors entering and exiting at different times and then use the cumulative sum to compute the number of visitors there at a particular time. This seems to meet your requirement of the code running quickly, though it does not use linear algebra.

diffs = rep(0, simDuration+1)

# Store the number of times a visitor enters and exits at each timestep. The table
# will contain headers that are the timesteps and values that are the number of
# people entering or exiting at the timestep.
tabEnter = table(pmax(1, ceiling(enterTimestamp)))
tabExit = table(pmin(simDuration+1, ceiling(exitTimestamp)))

# For each time index, add the number of people entering and subtract the number of
# people exiting. For instance, if in period 20, 3 people entered and 4 exited, then
# diffs[20] equals -1. as.numeric(names(tabEnter)) is the periods for which at least
# one person entered, and tabEnter is the number of people in each of those periods.
diffs[as.numeric(names(tabEnter))] = diffs[as.numeric(names(tabEnter))] + tabEnter
diffs[as.numeric(names(tabExit))] = diffs[as.numeric(names(tabExit))] - tabExit

# cumsum() sums the diffs vector through a particular time point. 
visitorLoad2 = head(cumsum(diffs), simDuration)

How to vectorize comparisons instead of for-loop in R?

Answers (2)

Related Questions