JvanWijk
JvanWijk

Reputation: 13

Filling a matrix with for-loop output

I want to fill a matrix with data simulated by using a for-loop containing the rbinom-function. This loop executes the rbinom-function 100 times, thus generating a different outcome every run. However, I can't find a way to get these outcomes in a matrix for further analysis. When assigning the for loop to an object, this object appears empty in the environment and can thus not be used in the matrix. ('data' must be of a vector type, was 'NULL').

When not including the rbinom-function in a for loop, it can be assigned to an object and I'm able to use the output in the matrix. Every column, however, contains the exact same sequence of numbers. When only running the for loop containing the rbinom-function, I do get different sequences, as it runs the rbinom function 100 times instead of 1 time. I just don't know how to integrate the loop in te matrix.

The two pieces of code I have:

n = 100                                                               
size = 7
loop_vill <- for (i in 1:100) {
  print(rbinom(n=n, size=size, prob=0.75))        #working for-loop                 
}


vill <- rbinom(n=n, size=size, prob=0.75)


sim_data_vill <- matrix(data=vill, nrow=length(loop_vill), ncol=100)   
#creates a matrix in which all columns are exact copies, should be solved 
when able to use outputs of loop_vill.

sim_data_vill

When calling sim_data_vill, it (logically) contains a matrix of 100 rows and 100 columns, with all columns being the same. However, I would like to see a matrix with all columns being different (thus containing the output of a new run of the rbinom-function each time).

Upvotes: 1

Views: 1968

Answers (1)

Oliver
Oliver

Reputation: 8602

Hello as far as i can see you are having a few problems.

  1. You are currently not running the for loop for each column (only the 1 vector is saved in vill)
  2. You are not looping over the rbinom

Now there's a few ways to achieve what you want. (Scroll to the last example for the efficient way)

method 1: For loop

Using your idea we can use a for loop. The best idea is to save an empty matrix first and fill it in with the for loop

nsim <- 100 #how many rbinom are w
n <- 100000
size = 7
prob = 0.75
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
    sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time

Now this will work, but is a bit slow, and requires a whopping 3 lines of code for the simulation part!

method 2: apply

We can remove the for loop and pre-assigned matrix with using one of the myriad apply like functions. One such function is replicate. This reduces the massive 3 lines of code to:

sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))

wuh.. That was short, but can we do even better? Actually running functions such as rbinom multiple times can be rather slow and costly.

method 3: using vectorized functions (very fast)

One thing you will hear whispered (or shouted) is the word vectorized, when it comes to programming in R. Basically, calling a function will induce overhead, and if you are working with a vectorized function, calling it once, will make sure you only induce overhead once, instead of multiple times. All distribution functions in R such as rbinom are vectorized. So what if we just do all the simulation in one go?

sim_data_vill_vectorized_functions <- matrix(rbinom(nsim * n, size, prob), ncol = nsim, nrow = n, byrow = FALSE) #perform all simulations in 1 rbinom call, and fill in 1 matrix.

So lets just quickly check how much faster this is compared to using a for loop or apply. This can be done using the microbenchmark package:

library(microbenchmark)
microbenchmark(for_loop = {
    sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
    for(i in seq(nsim)) #iterate from 1 to nsim
        sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
},
    apply = {
        sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
    },
    vectorized = {
        sim_data_vill_vectorized <- matrix(rbinom(nsim * n, size = size, prob = prob), ncol = nsim, nrow = n, byrow = FALSE)
    }
)

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
   for_loop 751.6121 792.5585 837.5512 812.7034 848.2479 1058.4144   100
      apply 752.4156 781.3419 837.5626 803.7456 901.6601 1154.0365   100
 vectorized 696.9429 720.2255 757.7248 737.6323 765.3453  921.3982   100

Looking at the median time, running all the simulations at once is about 60 ms. faster than using a for loop. As such here it is not a big deal, but in other cases it might be. (reverse n and nsim, and you will start seeing the overhead becoming big part of the calculations.)

Even if it is not a big deal, using vectorized computations where they pop up, is in all cases prefered, to make code more readable, and to avoid unnecessary bottlenecks that have already been optimized in implemented code.

Upvotes: 1

Related Questions