Reputation: 4482
I am trying to run optimizing grid for 2 algorithms (random forest
and gbm
) for different parts of a data set, using h2o
. My code looks like
for (...)
{
read data
# setup h2o cluster
h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)
gbm.grid <- h2o.grid("gbm", grid_id = "gbm.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary,
seed = 42, distribution = "bernoulli",
training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
hyper_params = hyper_params, search_criteria = search_criteria)
# shutdown h2o
h2o.shutdown(prompt = FALSE)
# setup h2o cluster
h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)
rf.grid <- h2o.grid("randomForest", grid_id = "rf.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary,
seed = 42, distribution = "bernoulli",
training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
hyper_params = hyper_params, search_criteria = search_criteria)
h2o.shutdown(prompt = FALSE)
}
The problem is that if i run the for loop
in one go, i get the error
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix, :
Unexpected CURL error: Failed to connect to localhost port 54321: Connection refused
P.S.: I am using the line
# shutdown h2o
h2o.shutdown(prompt = FALSE)
# setup h2o cluster
h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)
So that I "reset" the h2o
, so that i do not run out of memory
I also read R H2O - Memory management but it is not clear to me how it works.
UPDATE
After following Matteusz comment, i init
outside the for loop
and inside of the for loop
i use h2o.removeAll()
. So now my code looks like this
h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)
for(...)
{
read data
gbm.grid <- h2o.grid("gbm", grid_id = "gbm.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary,
seed = 42, distribution = "bernoulli",
training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
hyper_params = hyper_params, search_criteria = search_criteria)
h2o.removeAll()
rf.grid <- h2o.grid("randomForest", grid_id = "rf.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary,
seed = 42, distribution = "bernoulli",
training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
hyper_params = hyper_params, search_criteria = search_criteria)
h2o.removeAll() }
It seems to work, but now i get this error (?) in the grid optimization
for random forest
Any ideas what this might be ?
Upvotes: 1
Views: 1560
Reputation: 8819
The cause of the error is that you are not changing the grid_id
parameter in your loop. My recommendation is to let H2O auto-generate a grid id by leaving it unspecified/NULL. You can also create different grid ids (one for each dataset) manually, but it's not required.
You can only add new models to an existing grid (by re-using the same grid id) when you use the same training set. When you put a grid search in a for loop over different datasets and keep the same grid id, it will throw an error because you are trying to append models trained on different datasets to the same grid.
Upvotes: 1
Reputation: 15141
This seems quite wasteful, starting up h2o twice every iteration. If you just want to free up the memory you can use h2o.removeAll()
instead.
As for the cause, h2o.shutdown()
(any H2O shutdown) is not a synchronized operation and some cleanup can still occur after the function returns (for example handling of outstanding requests). You can check using h2o.clusterIsUp()
whether the cluster is actually down before starting it again with init
.
Upvotes: 3