Amateur
Amateur

Reputation: 1277

How to recode while loop to optimize performance for large simulation in R?

I need to generate simulated data where the percent censored cannot be 0 or 1. That's why I use while loop. The problem is if I increase count to 10,000 (instead of 5), the program is very slow. I have to repeat this with 400 different scenarios so it is extremely slow. I'm trying to figure out places where I can vectorize my code piece by piece. How can I avoid while-loop and still able to keep the condition?

Another approach is keep the while loop and generate a list of 10,000 dataset that meet my criteria and then apply the function to the list. Here I use summary function as an example but my real function use both X_after and delta (ie. mle(X_after,delta)). Is this a better option if I have to use while loop?

Another concern I have is memory issue. How can I avoid using up memory while doing such large simulation?

mu=1 ; sigma=3 ; n=10 ; p=0.10
dset <- function (mu,sigma, n, p) {              
   Mean <- array()
   Median <- array()
   Pct_cens_array <- array()
   count = 0
   while(count < 5) { 

     lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
     X_before <- rlnorm(n, log(mu), log(sigma))
     X_after <-  ifelse(X_before <= lod, lod,  X_before)
     delta <- ifelse(X_before <= lod, 1,  0) 
     pct_cens <- sum(delta)/length(delta)
     # print(pct_cens)
     if (pct_cens == 0 | pct_cens == 1 ) next
     else {
        count <-  count +1
        if (pct_cens > 0 & pct_cens < 1) {
             sumStats <- summary(X_after)
             Median[count] <- sumStats[3]
             Mean [count]<- sumStats[4]
             Pct_cens_array [count] <- pct_cens 
             print(list(pct_cens=pct_cens,X_after=X_after, delta=delta, Median=Median,Mean=Mean,Pct_cens_array=Pct_cens_array))
          }
       }
    }

          return(data.frame(Pct_cens_array=Pct_cens_array, Mean=Mean, Median=Median)) 
 }

Upvotes: 0

Views: 1198

Answers (2)

Richie Cotton
Richie Cotton

Reputation: 121057

I've made a few little tweaks to your code without changing the whole style of it. It would be good to heed Yoong Kim's advice and try to break up the code into smaller pieces, to make it more readable and maintainable.

  • Your function now gets two "n" arguments, for how many samples you have in each row, and how many iterations (columns) you want.

  • You were growing the arrays Median and Mean in the loop, which requires a lot of messing about reallocating memory and copying things, which slows everything down. I've predefined X_after and moved the mean and median calculations after the loop to avoid this. (As a bonus, mean and median only get called once instead of n_iteration times.)

  • The calls to ifelse weren't really needed.

  • It is a little quicker to call rlnorm once, generating enough values for x and the lod, than to call it twice.

Here's the updated function.

dset2 <- function (mu, sigma, n_samples, n_iterations, p) {    
  X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
  pct_cens <- numeric(n_iterations)
  count <- 1
  while(count <= n_iterations) {     
    random_values <- rlnorm(2L * n_samples, log(mu), log(sigma))
    lod <- quantile(random_values[1:n_samples], p = p)
    X_before <- random_values[(n_samples + 1L):(2L * n_samples)]
    X_after[count, ] <- pmax(X_before, lod)
    delta <- X_before <= lod
    pct_cens[count] <- mean(delta)
    if (pct_cens > 0 && pct_cens < 1 ) count <- count + 1
  }

  Median <- apply(X_after, 1, median)
  Mean <- rowMeans(X_after)
  data.frame(Pct_cens_array=pct_cens, Mean=Mean, Median=Median) 
}

Compare timings with, for example,

mu=1
sigma=3
n_samples=10L
n_iterations = 1000L
p=0.10
system.time(dset(mu,sigma, n_samples, n_iterations, p))
system.time(dset2(mu,sigma, n_samples, n_iterations, p))

On my machine, there is a factor of 3 speedup.

Upvotes: 2

Yoong Kim
Yoong Kim

Reputation: 310

First rule I learnt with C programming: divide to reign! I mean you should first create multiple functions and call them into your loop because this loop does too many different things. And I am worried about your algorithm:

if (pct_cens == 0 | pct_cens == 1 ) next
            else {count <-  count +1

Is there any reason you use while instead of for? There is a difference between the loops while and for: with while, you always have a first loop, not with for.

Finally, about your problem: use more memory with an array to increase the speed. Example:

lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
            X_before <- rlnorm(n, log(mu), log(sigma))

log(mu) and log(sigma) are computed twice: use variables to store the result, you will save time but spend more memory of course.

Upvotes: 2

Related Questions