R h2o connection (memory) issue

Question

I am trying to run optimizing grid for 2 algorithms (random forest and gbm) for different parts of a data set, using h2o. My code looks like

for (...)
{
        read data

        # setup h2o cluster
        h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)

        gbm.grid <- h2o.grid("gbm", grid_id = "gbm.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary, 
                             seed = 42, distribution = "bernoulli",
                             training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
                             hyper_params = hyper_params, search_criteria = search_criteria)

    # shutdown h2o
    h2o.shutdown(prompt = FALSE)

    # setup h2o cluster
    h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)

    rf.grid <- h2o.grid("randomForest", grid_id = "rf.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary, 
                        seed = 42, distribution = "bernoulli",
                        training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
                        hyper_params = hyper_params, search_criteria = search_criteria)

    h2o.shutdown(prompt = FALSE)
}

The problem is that if i run the for loop in one go, i get the error

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix,  : 
  Unexpected CURL error: Failed to connect to localhost port 54321: Connection refused

P.S.: I am using the line

 # shutdown h2o
h2o.shutdown(prompt = FALSE)

# setup h2o cluster
h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)

So that I "reset" the h2o, so that i do not run out of memory

I also read R H2O - Memory management but it is not clear to me how it works.

UPDATE

After following Matteusz comment, i init outside the for loop and inside of the for loop i use h2o.removeAll(). So now my code looks like this

 h2o <- h2o.init(ip = "localhost", port = 54321, nthreads = detectCores()-1)
for(...)
{
read data

gbm.grid <- h2o.grid("gbm", grid_id = "gbm.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary, 
                             seed = 42, distribution = "bernoulli",
                             training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
                             hyper_params = hyper_params, search_criteria = search_criteria)

h2o.removeAll()

rf.grid <- h2o.grid("randomForest", grid_id = "rf.grid", x = names(td.train.h2o)[!names(td.train.h2o)%like%segment_binary], y = segment_binary, 
                        seed = 42, distribution = "bernoulli",
                        training_frame = td.train.h2o, validation_frame = td.train.hyper.h2o,
                        hyper_params = hyper_params, search_criteria = search_criteria)

h2o.removeAll() }

It seems to work, but now i get this error (?) in the grid optimization for random forest

Any ideas what this might be ?

Mateusz Dymczyk · Accepted Answer

This seems quite wasteful, starting up h2o twice every iteration. If you just want to free up the memory you can use h2o.removeAll() instead.

As for the cause, h2o.shutdown() (any H2O shutdown) is not a synchronized operation and some cleanup can still occur after the function returns (for example handling of outstanding requests). You can check using h2o.clusterIsUp() whether the cluster is actually down before starting it again with init.

R h2o connection (memory) issue

Answers (2)

Related Questions