kartoffelsalat
kartoffelsalat

Reputation: 126

How to control the number of CPUs used by R?

I'm using the R package crossmatch that itself relies on some other R packages ( survival, nbpMatching, MASS) and that in turn import a wide range of more dependencies. The crossmatch package implements a statistical test on a (potentially) large matrix, that I need to compute very often (within an MCMC algorithm). I've written the following wrapper that computes some preprocessing steps before the actual test is computed (which is the crossmatchtest() in the last line):

# wrapper function to directly call the crossmatch test with a single matrix
# first column of the matrix must be a binary group indicator, following columns are observations
# code is modified from the documentation of the crossmatch package
crossmatchdata <- function(dat) {

  # the grouping variable should be in the first column
  z = dat[,1]
  X = subset(dat, select = -1)

  ## Rank based Mahalanobis distance between each pair:
  # X <- as.matrix(X)
  n <- dim(X)[1]
  k <- dim(X)[2]

  for (j in 1:k) {
    X[, j] <- rank(X[, j])
  }

  cv <- cov(X)
  vuntied <- var(1:n)
  rat <- sqrt(vuntied / diag(cv))

  cv <- diag(rat) %*% cv %*% diag(rat)
  out <- matrix(NA, n, n)

  icov <- ginv(cv)
  for (i in 1:n) {
    out[i, ] <- mahalanobis(X, X[i, ], icov, inverted = TRUE)
  }

  dis <- out

  ## The cross-match test:
  return(crossmatchtest(z, dis))
}

I've noticed that if the matrix is rather small, this test will only use one CPU:

library(MASS)
library(crossmatch)
source("theCodeFromAbove.R")
# create a dummy matrix
m = cbind(c(rep(0, 100), rep(1, 100)))
m = cbind(m, (matrix(runif(100), ncol=10, nrow=20, byrow=T)))
while(TRUE) { crossmatchdata(m) }

as monitored via htop. However, if I'm increasing this matrix, R will use as many cores as are available (at least it looks like this):

# create a dummy matrix
m = cbind(c(rep(0, 1000), rep(1, 1000)))
m = cbind(m, (matrix(runif(100000), ncol=1000, nrow=2000, byrow=T)))
while(TRUE) { crossmatchdata(m) }

I'm fine with this parallelization in general but I would like to be able to manually control the number of cores the R process is using. I've tried options(mc.cores = 4) without success.

Is there any other variable I could set? Or what's the best way of finding the package that's responsible for the use of more than one core?

Upvotes: 5

Views: 619

Answers (1)

Roland
Roland

Reputation: 132576

Let's look at the dependencies:

library(miniCRAN)
tags <- "crossmatch"
dg <- makeDepGraph(tags, enhances = FALSE, suggests = FALSE)
set.seed(1)
plot(dg, legendPosition = c(-1, 1), vertex.size = 20)

resulting plot of dependencies

That is quite a few dependencies. At a first glance, there is no package for R level parallelization there. That leaves the possibility of packages using parallelization via compiled code. One such package is data.table (there might be others), try if using setDTthreads(1) turns off parallelization.

Of course, you might also have R linked to an optimized BLAS. If that's the case, the parallelization most likely happens there during matrix algebra.

Update:

@Dirk Eddelbuettel just pointed out that packages RhpcBLASctl and OpenMPController allow controlling the number of cores used by the BLAS or OpenMP.

Edit by kartoffelsalat:

The following worked for the issue in the question under Ubuntu 16.04. It did not work under macOS (neither did the package OpenMPController).

library(RhpcBLASctl)
blas_set_num_threads(3)

Upvotes: 3

Related Questions