Why does the following for-loop utilize all of the cores in my machine in R?

Question

I have the following R code with parallelization not explicitly enabled:

matrix <- matrix(rnorm(1000^2), ncol = 1000)
vec <- rnorm(1000)

for (i in 1:10000){
  a <- sum(matrix%*%vec)
}

When I execute the for loop, I notice in my system monitor that all cores are being utilized at 100%. It was my understanding that for loops in R are always serial. I do notice with a single large matrix multiplication that only one core is utilized, so I do not believe that the parallelization is occurring in the matrix multiplication.

The larger problem here is that I've written an MCMC sampler that needs to be run in serial as a Markov chain, but when I run the sampler, I see that all cores are being utilized. The code above is just a minimum working example. Should I be concerned that the MCMC sampler is not running properly in serial (i.e. as a Markov chain)?

I'm using R 3.5.2 inside of the rocker/tidyverse:3.5.2 Docker container and my local OS is Ubunutu 18.04.

Thanks for any help!

Here is my session info:

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2    yaml_2.2.0

Vincent · Accepted Answer

Thanks for all of the helpful comments. Looks like it is that BLAS utilizes multiple threads for matrix multiplication, and by default, it was using all 12.

Interestly, when reducing the number of BLAS threads via RhpcBLASctl::blas_set_num_threads(1), the total compute time decreases. See the results below for my machine with 12 logical processors:

RhpcBLASctl::blas_get_num_procs()
RhpcBLASctl::blas_set_num_threads(12)

matrix <- matrix(rnorm(1000^2), ncol = 1000)
vec <- rnorm(1000)

system.time(
for (i in 1:2000){
  matrix1 <- matrix + 1
  a <- sum(matrix1%*%vec)
}
)

RhpcBLASctl::blas_set_num_threads(1)
matrix <- matrix(rnorm(1000^2), ncol = 1000)
vec <- rnorm(1000)
system.time(
  for (i in 1:2000){
    matrix <- matrix + 1
    a <- sum(matrix1%*%vec)
  }
)

You'll see that it actually runs faster with only one thread (maybe because of data transfer overhead?). For my MCMC sampler, I'll set the number of threads to 1, then make use of the other cores where parallel processing will actually improve compute times (i.e. running multiple chains in parallel).

Why does the following for-loop utilize all of the cores in my machine in R?

Answers (1)

Related Questions