Reputation: 57696
Are there any known issues with running xgboost in multiple R processes simultaneously?
The background is that I'm trying to do a simple grid search for the best hyperparameters. Since I have several cores on this machine, I thought I'd run the models in parallel. However, xgb.DMatrix
dies with an error:
> cl <- makeCluster(4)
> clusterEvalQ(cl, {
+ load("processed_data.rdata")
+ library(xgboost)
+ traindata <- xgb.DMatrix(as.matrix(train_df), label=train_y)
+ testdata <- xgb.DMatrix(as.matrix(test_df), label=test_y)
+ })
[[1]]
Error in dim.xgb.DMatrix(x) :
[12:59:39] amalgamation/../src/c_api/c_api.cc:355: DMatrix/Booster has not been intialized or has already been disposed.
It looks like trying to load the xgboost code into multiple processes is not allowed. Is this correct? If so, what is the best way to conduct a parallelised grid search?
My session info:
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] purrr_0.3.4 xgboost_1.2.0.1 tidyr_1.1.2 dplyr_1.0.2
loaded via a namespace (and not attached):
[1] magrittr_2.0.1 tidyselect_1.1.0 lattice_0.20-41 R6_2.5.0
[5] rlang_0.4.9 fansi_0.4.1 tools_4.0.3 grid_4.0.3
[9] data.table_1.13.2 utf8_1.1.4 cli_2.2.0 ellipsis_0.3.1
[13] assertthat_0.2.1 tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4
[17] Matrix_1.2-18 vctrs_0.3.5 glue_1.4.2 stringi_1.5.3
[21] compiler_4.0.3 pillar_1.4.7 generics_0.1.0 pkgconfig_2.0.3
Upvotes: 1
Views: 298
Reputation: 57696
Solved it. What happens is that clusterEvalQ
, per convention in R, uses the value of the last evaluated expression as the returned value to the head node. This is done by serialising the value into the pipeline between worker and head, and then deserialising it again. However, an xgb.DMatrix
cannot be serialised like a regular R object, since it contains external pointers; trying to deserialise it causes the error.
The solution is for the expression passed to clusterEvalQ
to have anything other than an xgb.DMatrix
call as its last line. NULL
will do:
> clusterEvalQ(cl, {
+ load(file.path(local_dir, "processed_data.rdata")
+ library(xgboost)
+ traindata <- xgb.DMatrix(as.matrix(train_df), label=train_y)
+ testdata <- xgb.DMatrix(as.matrix(test_df), label=test_y)
+ NULL
+ })
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
Upvotes: 2
Reputation: 5398
You have an extra )
on the load() line. Else,
Maybe it would work to split up the task into discrete steps?
library(xgboost)
load("processed_data.rdata")
cl <- makeCluster(4)
clusterEvalQ(cl, {
traindata <- xgb.DMatrix(as.matrix(train_df), label=train_y)
})
clusterEvalQ(cl, {
testdata <- xgb.DMatrix(as.matrix(test_df), label=test_y)
})
Upvotes: 0