How to parallelize an xgboost fit?

Question

I am trying to fit many xgboost models with different parameters (e.g. for parameter tuning). Running them in parallel is needed to reduce time. However, upon running the %dopar% command I get the following error: Error in unserialize(socklist[[n]]) : error reading from connection.

Below is a reproducible example. It has to do with xgboost, since any other calculation involving global variables works within the %dopar% loop. Could someone point out what is missing/wrong with this approach?

#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)

#### Data Sim
n = 1000
X = cbind(runif(n,10,20), runif(n,0,10))
y = 10 + 2*X[,1] + 3*X[,2] + rnorm(n,0,1)

#### Init XGB
train = xgb.DMatrix(data  = X[-((n-10):n),], label = y[-((n-10):n)])
test  = xgb.DMatrix(data  = X[(n-10):n,],    label = y[(n-10):n]) 
watchlist = list(train = train, test = test)

#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)

clusterEvalQ(cl, {
  library(xgboost)
})

pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
  xgb.train(data = train, watchlist = watchlist, max_depth=i, nrounds = 1000, early_stopping_rounds = 10)$best_score
 # if xgb.train is replaced with anything else, e.g. 1+y, it works
} 

stopCluster(cl)

Chris · Accepted Answer

As noted in the comments by HenrikB xgb.DMatrix objects can't be used in parallelization. To get around this we can make the object inside of foreach:

#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)
#> Loading required package: iterators

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
  
  
  
  
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
    # BRING CREATION OF XGB MATRIX INSIDE OF foreach
    dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
    dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
    
    watchlist = list(dtrain = dtrain, dtest = dtest)
    
    param <- list(max_depth = i, eta = 0.01, verbose = 0,
                  objective = "binary:logistic", eval_metric = "auc")
    bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
    bst$best_score
    } 

stopCluster(cl) 
pred
#> [[1]]
#> dtest-auc 
#>  0.892138 
#> 
#> [[2]]
#> dtest-auc 
#>  0.987974 
#> 
#> [[3]]
#> dtest-auc 
#>  0.986255 
#> 
#> [[4]]
#> dtest-auc 
#>         1 
#>  ...

Benchmarking:

Since xgboost.train is already parellalized, it might be interesting to see the difference in speeds between when threads are used for xgboost vs when used for the parallel running of tuning rounds.

To do this I wrapped in a function and benchmarked the different combinations:


tune_par <- function(xgbthread, doparthread) {
  
  data(agaricus.train, package='xgboost')
  data(agaricus.test, package='xgboost')
  
  #### Init parallel & run
  cl = parallel::makeCluster(doparthread, setup_strategy = "sequential")
  doParallel::registerDoParallel(cl)
  
  clusterEvalQ(cl, {
    data(agaricus.train, package='xgboost')
    data(agaricus.test, package='xgboost')
  })
  
  
  
  pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
    dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
    dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
    
    watchlist = list(dtrain = dtrain, dtest = dtest)
    
    param <- list(max_depth = i, eta = 0.01, verbose = 0, nthread = xgbthread,
                  objective = "binary:logistic", eval_metric = "auc")
    bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
    bst$best_score
  } 
  
  stopCluster(cl) 
  
  pred
  
}

In my testing evaluation was faster when using more threads for xgboost and less for the parallel running of tuning rounds. What works best probably depends on system specs and the amount of data.

# 16 logical cores split between xgb threads and threads in dopar cluster:
microbenchmark::microbenchmark(
  xgb16par1 = tune_par(xgbthread = 16, doparthread = 1),
  xgb8par2 = tune_par(xgbthread = 8, doparthread = 2),
  xgb4par4 = tune_par(xgbthread = 4,doparthread = 4),
  xgb2par8 = tune_par(xgbthread = 2, doparthread = 8),
  xgb1par16 = tune_par(xgbthread = 1,doparthread = 16),
  times = 5
)
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval  cld
#>  xgb16par1 2.295529 2.431110 2.500170 2.519277 2.527914 2.727021     5 a   
#>   xgb8par2 2.301189 2.308377 2.407767 2.363422 2.465446 2.600402     5 a   
#>   xgb4par4 2.632711 2.778304 2.875816 2.825471 2.849003 3.293593     5  b  
#>   xgb2par8 4.508485 4.682284 4.752776 4.810461 4.822566 4.940085     5   c 
#>  xgb1par16 8.493378 8.550609 8.679931 8.768008 8.779718 8.807943     5    d

How to parallelize an xgboost fit?

Answers (1)

Benchmarking:

Related Questions