Reputation: 99
Imagine I hand you a ping pong ball with a "-1" printed on it. I then tell you to draw another ping pong ball from a bag marked "First Bag". This bag has 30,000 balls in it, some marked with a "-1", some with "0", and some with "+1". Whichever ball you draw, you add its number to your current "score" of -1. E.g., if you draw a -1 your new score is -2.
So long as your new score is below zero you draw again from the first bag and add to your score again. But if and when your score gets to zero or higher, you then draw from a Second Bag which has a different composition of -1s 0s and +1s.
I want you to draw a total of 1000 ping pong balls from the appropriate bags (i.e., depending on whether your current score is below zero or not) and then write down your total (cumulative) score at the end of the "set". Then I want you to repeat this experiment a million times and tell me what percentage of the sets you ended up with a score higher than zero.
Is there a faster/more efficient way to code this? It's difficult to vectorize the loop since the draws are not independent, though maybe I can use some combo of ifelse
and filter
? I suspect it's the replicating that's the expensive part though.
ptm <- proc.time()
###First bag
n=30000
s=155
f=255
z=n-s-f
first_bag=c(rep(0,z), rep(1,s), rep(-1,f))
###Second bag
n2=30000
s2=275
f2=285
z2=n2-s2-f2
second_bag=c(rep(0,z2), rep(1,s2), rep(-1,f2))
###Simulate draws
sim_draw=function(draws){
score=-1
for (i in 1:draws) {
if (score < 0) {
score=score + sample(first_bag, 1, replace=TRUE)} else {
score=score + sample(second_bag, 1, replace=TRUE)}
}
score
}
###Repeat sims and find area above zero
samp_distribution=replicate(1000000, sim_draw(1000))
mean(samp_distribution>0)
print(proc.time() - ptm)
Upvotes: 2
Views: 1270
Reputation: 89057
A few ideas:
Learn to use the profiler to see where your implementation is wasting time. See the example at the bottom of ?summaryRprof
for how to use it, just replace example(glm)
with your code: samp_distribution <- replicate(1000, sim_draw(1000))
. You will notice that a lot of time is wasted calling sample
over and over. So the first improvement to your code can be to call sample
only a couple times:
sim_draw_1 <- function(draws){
s1 <- sample(bag1, draws, replace = TRUE)
s2 <- sample(bag2, draws, replace = TRUE)
score <- -1
for (i in 1:draws)
score <- score + if (score < 0) s1[i] else s2[i]
score
}
See that this is nearly ten times faster (I find the microbenchmark package to be a more reliable method for measuring/comparing computation times)
library(microbenchmark)
microbenchmark(sim_draw(1000), sim_draw_1(1000),
times = 1000)
# Unit: microseconds
# expr min lq median uq max neval
# sim_draw(1000) 5518.758 5845.465 6036.1375 6340.662 53023.483 1000
# sim_draw_1(1000) 690.796 730.292 743.8435 785.071 8248.163 1000
For very iterative code like yours, it is always worth trying the compiler:
library(compiler)
sim_draw_2 <- cmpfun(sim_draw_1)
library(microbenchmark)
microbenchmark(sim_draw_1(1000), sim_draw_2(1000), times = 1000)
# Unit: microseconds
# expr min lq median uq max neval
# sim_draw_1(1000) 684.687 717.6640 748.3305 812.971 9412.936 1000
# sim_draw_2(1000) 242.895 259.8125 268.3925 294.343 1710.290 1000
Another 3x improvement, not bad.
Last, since what is still the biggest bottle neck inside the function is the for loop, you can try to rewrite it so instead of processing one outcome at a time, you use vectorized (fast) functions to process as many outcomes as possibles (the exact number of outcomes that will force you to switch hats.)
sim_draw_3 <- function(draws, bag1 = first_bag,
bag2 = second_bag){
s1 <- sample(bag1, draws, replace = TRUE)
s2 <- sample(bag2, draws, replace = TRUE)
score <- -1L
idx <- 1L
while (idx <= draws) {
bag <- if (score < 0) s1 else s2
switch.at <- if (score < 0) 0L else -1L
next.draws <- bag[idx:draws]
next.scores <- score + cumsum(next.draws)
stop.idx <- which(next.scores == switch.at)[1]
if (is.na(stop.idx)) stop.idx <- length(next.draws)
score <- next.scores[stop.idx]
idx <- idx + stop.idx
}
score
}
sim_draw_4 <- cmpfun(sim_draw_3)
microbenchmark(sim_draw_2(1000), sim_draw_3(1000), sim_draw_4(1000), times = 1000)
# Unit: microseconds
# expr min lq median uq max neval
# sim_draw_2(1000) 236.916 252.540 269.1355 293.7775 7819.841 1000
# sim_draw_3(1000) 80.527 95.185 128.9840 162.7790 625.986 1000
# sim_draw_4(1000) 79.486 92.378 123.5535 162.5085 518.594 1000
Another 2x improvement. Here you see that the compiler only gets us a marginal improvement because the number of iterations has dropped dramatically, and everything else in our R code is using very efficient (vectorized) functions.
So we got from 5845 microseconds to 124 per function call, a pretty good improvement. If this is still too slow, then you will probably have to switch to c++ (via Rcpp for example). At least I hope this helped show you some useful tricks.
Last but not least, I would mention that since your function calls are all independent, you could look into running them in parallel. I'll point you to http://cran.r-project.org/web/views/HighPerformanceComputing.html and encourage you to search around.
Upvotes: 5