Reputation: 1277
I need to generate simulated data where the percent censored cannot be 0 or 1. That's why I use while loop. The problem is if I increase count to 10,000 (instead of 5), the program is very slow. I have to repeat this with 400 different scenarios so it is extremely slow. I'm trying to figure out places where I can vectorize my code piece by piece. How can I avoid while-loop and still able to keep the condition?
Another approach is keep the while loop and generate a list of 10,000 dataset that meet my criteria and then apply the function to the list. Here I use summary function as an example but my real function use both X_after and delta (ie. mle(X_after,delta)). Is this a better option if I have to use while loop?
Another concern I have is memory issue. How can I avoid using up memory while doing such large simulation?
mu=1 ; sigma=3 ; n=10 ; p=0.10
dset <- function (mu,sigma, n, p) {
Mean <- array()
Median <- array()
Pct_cens_array <- array()
count = 0
while(count < 5) {
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
X_after <- ifelse(X_before <= lod, lod, X_before)
delta <- ifelse(X_before <= lod, 1, 0)
pct_cens <- sum(delta)/length(delta)
# print(pct_cens)
if (pct_cens == 0 | pct_cens == 1 ) next
else {
count <- count +1
if (pct_cens > 0 & pct_cens < 1) {
sumStats <- summary(X_after)
Median[count] <- sumStats[3]
Mean [count]<- sumStats[4]
Pct_cens_array [count] <- pct_cens
print(list(pct_cens=pct_cens,X_after=X_after, delta=delta, Median=Median,Mean=Mean,Pct_cens_array=Pct_cens_array))
}
}
}
return(data.frame(Pct_cens_array=Pct_cens_array, Mean=Mean, Median=Median))
}
Upvotes: 0
Views: 1198
Reputation: 121057
I've made a few little tweaks to your code without changing the whole style of it. It would be good to heed Yoong Kim's advice and try to break up the code into smaller pieces, to make it more readable and maintainable.
Your function now gets two "n" arguments, for how many samples you have in each row, and how many iterations (columns) you want.
You were growing the arrays Median
and Mean
in the loop, which requires a lot of messing about reallocating memory and copying things, which slows everything down. I've predefined X_after
and moved the mean and median calculations after the loop to avoid this. (As a bonus, mean
and median
only get called once instead of n_iteration
times.)
The calls to ifelse
weren't really needed.
It is a little quicker to call rlnorm
once, generating enough values for x and the lod, than to call it twice.
Here's the updated function.
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
random_values <- rlnorm(2L * n_samples, log(mu), log(sigma))
lod <- quantile(random_values[1:n_samples], p = p)
X_before <- random_values[(n_samples + 1L):(2L * n_samples)]
X_after[count, ] <- pmax(X_before, lod)
delta <- X_before <= lod
pct_cens[count] <- mean(delta)
if (pct_cens > 0 && pct_cens < 1 ) count <- count + 1
}
Median <- apply(X_after, 1, median)
Mean <- rowMeans(X_after)
data.frame(Pct_cens_array=pct_cens, Mean=Mean, Median=Median)
}
Compare timings with, for example,
mu=1
sigma=3
n_samples=10L
n_iterations = 1000L
p=0.10
system.time(dset(mu,sigma, n_samples, n_iterations, p))
system.time(dset2(mu,sigma, n_samples, n_iterations, p))
On my machine, there is a factor of 3 speedup.
Upvotes: 2
Reputation: 310
First rule I learnt with C programming: divide to reign! I mean you should first create multiple functions and call them into your loop because this loop does too many different things. And I am worried about your algorithm:
if (pct_cens == 0 | pct_cens == 1 ) next
else {count <- count +1
Is there any reason you use while instead of for? There is a difference between the loops while and for: with while, you always have a first loop, not with for.
Finally, about your problem: use more memory with an array to increase the speed. Example:
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
log(mu) and log(sigma) are computed twice: use variables to store the result, you will save time but spend more memory of course.
Upvotes: 2