pgoetz
pgoetz

Reputation: 919

R probability simulation that won't terminate?

I'm teaching a statistics class where I'm having students explore questions in probability and statistics through simulation using R. Recently there was some confusion about the probability of getting exactly two 6's when rolling 5 dice. The answer is choose(5,2)*5^3/6^5, but some students were convinced that "order shouldn't matter"; i.e. that the answer should be choose(5,2)*choose(25,3)/choose(30,5). I thought it would be fun to have them simulate rolling 5 dice thousands of times, keeping track of the empirical probability for each experiment, and then repeat the experiment many times. The problem is the two numbers above are sufficiently close that it's quite hard to get a simulation to tease out the difference in a statistically significant fashion (of course I could just be doing it wrong). I tried rolling 5 dice 100000 times, then repeating the experiment 10000 times. This took an hour or so to run on my i7 linux machine and still allowed for a 25% chance that the correct answer is choose(5,2)*choose(25,3)/choose(30,5). So I increased the number of dice rolls per experiment to 10^6. Now the code has been running for over 2 days and shows no sign of finishing. I'm confused by this, as I only increased the number of operations by an order of magnitude, implying that the run time should be closer to 10 hours.

Second question: Is there a better way to do this? See code posted below:

probdist = rep(0,10000)

for (j in 1:length(probdist))
{
   outcome = rep(0,1000000)
   for (k in 1:1000000)
   {
      rolls = sample(1:6, 5, replace=T)
      if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
   }

   probdist[j] = sum(outcome)/length(outcome)
}

Upvotes: 2

Views: 382

Answers (3)

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37754

Vectorization is almost always preferred to any for loop. In this case, you should see substantial speedup by generating all your dice throws first, then checking how many in each group of five equal 6.

set.seed(5)
N <- 1e6
foo <- matrix(sample(1:6, 5*N, replace=TRUE), ncol=5)
p <- mean(rowSums(foo==6)==2)
se <- sqrt(p*(1-p)/N)
p
## [1] 0.160382

Here's a 95% confidence interval:

p + se*qnorm(0.975)*c(-1,1)
## [1] 0.1596628 0.1611012

We can see that the true answer (ans1) is in the interval, but the false answer (ans2) is not, or we could perform significance tests; the p-value when testing the true answer is 0.31 but for the false answer is 0.0057.

(ans1 <- choose(5,2)*5^3/6^5)
## [1] 0.160751
pnorm(abs((ans1-p)/se), lower=FALSE)*2
## [1] 0.3145898

ans2 <- choose(5,2)*choose(25,3)/choose(30,5)
## [1] 0.1613967
pnorm(abs((ans2-p)/se), lower=FALSE)*2
## [1] 0.005689008

Note that I'm generating all the dice throws at once; if memory is an issue, you could split this up into pieces and combine, as you did in your original post. This is possibly what caused your unexpected speedup in time; if it was necessary to use swap memory, this would slow it substantially. If so, better to increase the number of time you run the loop, not the number of rolls within the loop.

Upvotes: 2

pgoetz
pgoetz

Reputation: 919

I had originally awarded a correct answer check to M. Berk for his/her suggestion to use the R replicate() function. Further investigation has forced to to rescind my previous endorsement. It turns out that replicate() is just a wrapper for sapply(), which doesn't actually afford any performance benefits over a for loop (this seems to be a common misconception). In any case, I prepared 3 versions of the simulation, 2 using a for loop, and one using replicate, as suggested, and ran them one after the other, starting from a fresh R session each time, in order to compare the execution times:

# dice26dist1.r: For () loop version with unnecessary array allocation
probdist = rep(0,100)

for (j in 1:length(probdist))
{
  outcome = rep(0,1000000)
  for (k in 1:1000000)
  {
    rolls = sample(1:6, 5, replace=T)
    if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
  }
  probdist[j] = sum(outcome)/length(outcome)
}

system.time(source('dice26dist1.r'))
user system elapsed
596.365 0.240 598.614

# dice26dist2.r: For () loop version
probdist = rep(0,100)

for (j in 1:length(probdist))
{
  outcomes = 0
  for (k in 1:1000000)
  {
    rolls = sample(1:6, 5, replace=T)
    if (length(rolls[rolls == 6]) == 2) outcomes = outcomes + 1
  }
  probdist[j] = outcomes/1000000
}

system.time(source('dice26dist2.r'))
user system elapsed
506.331 0.076 508.104

# dice26dist3.r:  replicate() version
doSample <- function()
{
   sum(sample(1:6,size=5,replace=TRUE)==6)==2
}

probdist = rep(0,100)

for (j in 1:length(probdist))
{
  samples = replicate(n=1000000,expr=doSample())
  probdist[j] = mean(samples)
}

system.time(source('dice26dist3.r'))
user system elapsed
804.042 0.472 807.250

From this you can see that the replicate() version is considerably slower than either of the for loop versions by any system.time metric. I had originally thought that my problem was mostly due to cache misses by allocating the million character outcome[] array, but comparing the times of dice26dist1.r and dice26dist2.r indicates that this only has nominal impact on performance (although the impact on system time is considerable: >300% difference.

One might argue that I'm still using for loops in all three simulations, but as far as I can tell this is completely unavoidable when simulating a random process; I have to simulate actually going through the random process (in this case, rolling 5 die) every time. I would love to know about any technique that would allow me to avoid using a for loop (in a way that improves performance, of course). I understand that this problem would lend itself very effectively to parallelization, but I'm talking about using a single R session -- is there a way to make this faster?

Upvotes: 2

M. Berk
M. Berk

Reputation: 199

A good rule of thumb is to never, ever write a for loop in R. Here's an alternative solution:

doSample <- function()
{
   sum(sample(1:6,size=5,replace=TRUE)==6)==2
}

> system.time(samples <- replicate(n=10000,expr=doSample()))
user  system elapsed 
0.06    0.00    0.06 
> mean(samples)
[1] 0.1588
> choose(5,2)*5^3/6^5
[1] 0.160751

Doesn't seem to be too accurate with $10,000$ samples. Better with $100,000$:

> system.time(samples <- replicate(n=100000,expr=doSample()))
user  system elapsed 
0.61    0.02    0.61 
> mean(samples)
[1] 0.16135

Upvotes: 3

Related Questions