Hong Ooi
Hong Ooi

Reputation: 57696

Issues with running xgboost in multiple R processes

Are there any known issues with running xgboost in multiple R processes simultaneously?

The background is that I'm trying to do a simple grid search for the best hyperparameters. Since I have several cores on this machine, I thought I'd run the models in parallel. However, xgb.DMatrix dies with an error:

> cl <- makeCluster(4)
> clusterEvalQ(cl, {
+     load("processed_data.rdata")
+     library(xgboost)
+     traindata <- xgb.DMatrix(as.matrix(train_df), label=train_y)
+     testdata <- xgb.DMatrix(as.matrix(test_df), label=test_y)
+ })
[[1]]
Error in dim.xgb.DMatrix(x) :
  [12:59:39] amalgamation/../src/c_api/c_api.cc:355: DMatrix/Booster has not been intialized or has already been disposed.

It looks like trying to load the xgboost code into multiple processes is not allowed. Is this correct? If so, what is the best way to conduct a parallelised grid search?

My session info:

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] purrr_0.3.4     xgboost_1.2.0.1 tidyr_1.1.2     dplyr_1.0.2

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    tidyselect_1.1.0  lattice_0.20-41   R6_2.5.0
 [5] rlang_0.4.9       fansi_0.4.1       tools_4.0.3       grid_4.0.3
 [9] data.table_1.13.2 utf8_1.1.4        cli_2.2.0         ellipsis_0.3.1
[13] assertthat_0.2.1  tibble_3.0.4      lifecycle_0.2.0   crayon_1.3.4     
[17] Matrix_1.2-18     vctrs_0.3.5       glue_1.4.2        stringi_1.5.3    
[21] compiler_4.0.3    pillar_1.4.7      generics_0.1.0    pkgconfig_2.0.3

Upvotes: 1

Views: 298

Answers (2)

Hong Ooi
Hong Ooi

Reputation: 57696

Solved it. What happens is that clusterEvalQ, per convention in R, uses the value of the last evaluated expression as the returned value to the head node. This is done by serialising the value into the pipeline between worker and head, and then deserialising it again. However, an xgb.DMatrix cannot be serialised like a regular R object, since it contains external pointers; trying to deserialise it causes the error.

The solution is for the expression passed to clusterEvalQ to have anything other than an xgb.DMatrix call as its last line. NULL will do:

> clusterEvalQ(cl, {
+     load(file.path(local_dir, "processed_data.rdata")
+     library(xgboost)
+     traindata <- xgb.DMatrix(as.matrix(train_df), label=train_y)
+     testdata <- xgb.DMatrix(as.matrix(test_df), label=test_y)
+     NULL
+ })
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

Upvotes: 2

M.Viking
M.Viking

Reputation: 5398

You have an extra ) on the load() line. Else,

Maybe it would work to split up the task into discrete steps?

library(xgboost)
load("processed_data.rdata")
cl <- makeCluster(4)
clusterEvalQ(cl, {
  traindata <- xgb.DMatrix(as.matrix(train_df), label=train_y)
})
clusterEvalQ(cl, {
  testdata <- xgb.DMatrix(as.matrix(test_df), label=test_y)
})

Upvotes: 0

Related Questions