Reputation: 13
I want to fill a matrix with data simulated by using a for-loop containing the rbinom-function. This loop executes the rbinom-function 100 times, thus generating a different outcome every run. However, I can't find a way to get these outcomes in a matrix for further analysis. When assigning the for loop to an object, this object appears empty in the environment and can thus not be used in the matrix. ('data' must be of a vector type, was 'NULL').
When not including the rbinom-function in a for loop, it can be assigned to an object and I'm able to use the output in the matrix. Every column, however, contains the exact same sequence of numbers. When only running the for loop containing the rbinom-function, I do get different sequences, as it runs the rbinom function 100 times instead of 1 time. I just don't know how to integrate the loop in te matrix.
The two pieces of code I have:
n = 100
size = 7
loop_vill <- for (i in 1:100) {
print(rbinom(n=n, size=size, prob=0.75)) #working for-loop
}
vill <- rbinom(n=n, size=size, prob=0.75)
sim_data_vill <- matrix(data=vill, nrow=length(loop_vill), ncol=100)
#creates a matrix in which all columns are exact copies, should be solved
when able to use outputs of loop_vill.
sim_data_vill
When calling sim_data_vill, it (logically) contains a matrix of 100 rows and 100 columns, with all columns being the same. However, I would like to see a matrix with all columns being different (thus containing the output of a new run of the rbinom-function each time).
Upvotes: 1
Views: 1968
Reputation: 8602
Hello as far as i can see you are having a few problems.
vill
)rbinom
Now there's a few ways to achieve what you want. (Scroll to the last example for the efficient way)
For loop
Using your idea we can use a for loop
. The best idea is to save an empty matrix first and fill it in with the for loop
nsim <- 100 #how many rbinom are w
n <- 100000
size = 7
prob = 0.75
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
Now this will work, but is a bit slow, and requires a whopping 3 lines of code for the simulation part!
apply
We can remove the for loop and pre-assigned matrix with using one of the myriad apply
like functions. One such function is replicate
. This reduces the massive 3 lines of code to:
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
wuh.. That was short, but can we do even better? Actually running functions such as rbinom
multiple times can be rather slow and costly.
One thing you will hear whispered (or shouted) is the word vectorized
, when it comes to programming in R. Basically, calling a function will induce overhead, and if you are working with a vectorized function, calling it once, will make sure you only induce overhead once, instead of multiple times. All distribution functions in R such as rbinom
are vectorized. So what if we just do all the simulation in one go?
sim_data_vill_vectorized_functions <- matrix(rbinom(nsim * n, size, prob), ncol = nsim, nrow = n, byrow = FALSE) #perform all simulations in 1 rbinom call, and fill in 1 matrix.
So lets just quickly check how much faster this is compared to using a for loop
or apply
. This can be done using the microbenchmark package:
library(microbenchmark)
microbenchmark(for_loop = {
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
},
apply = {
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
},
vectorized = {
sim_data_vill_vectorized <- matrix(rbinom(nsim * n, size = size, prob = prob), ncol = nsim, nrow = n, byrow = FALSE)
}
)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 751.6121 792.5585 837.5512 812.7034 848.2479 1058.4144 100
apply 752.4156 781.3419 837.5626 803.7456 901.6601 1154.0365 100
vectorized 696.9429 720.2255 757.7248 737.6323 765.3453 921.3982 100
Looking at the median time, running all the simulations at once is about 60 ms. faster than using a for loop. As such here it is not a big deal, but in other cases it might be. (reverse n
and nsim
, and you will start seeing the overhead becoming big part of the calculations.)
Even if it is not a big deal, using vectorized computations where they pop up, is in all cases prefered, to make code more readable, and to avoid unnecessary bottlenecks that have already been optimized in implemented code.
Upvotes: 1