R H2O grid search: how to train top model on new data?

Question

After running a hyperparameter search and extracting the best model from the grid, is it possible to use the model object to train on a new data set? The only way I see now is to manually create a call to a train function (e.g. h2o.gbm()) with the parameters from the best model, but this is very cumbersome.

Sixiang.Hu · Accepted Answer

checkpoint parameter may meet your needs, which trains model further from original model.

This functionality is available for gbm,random forest and deep learning in h2o package.

Example code below copying from: http://s3.amazonaws.com/h2o-release/h2o/master/3689/docs-website/h2o-docs/data-science/algo-params/checkpoint.html

library(h2o)
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])

# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"

# split into train and validation sets
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]

# build a GBM with 1 tree (ntrees = 1) for the first model:
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
                    validation_frame = valid, ntrees = 1, seed = 1234)

# print the auc for the validation data
print(h2o.auc(cars_gbm, valid = TRUE))

# re-start the training process on a saved GBM model using the ‘checkpoint‘ argument:
# the checkpoint argument requires the model id of the model on which you wish to continue building
# get the model's id from "cars_gbm" model using `cars_gbm@model_id`
# the first model has 1 tree, let's continue building the GBM with an additional 49 more trees, so set ntrees = 50

# to see how many trees the original model built you can look at the `ntrees` attribute
print(paste("Number of trees built for cars_gbm model:", cars_gbm@allparameters$ntrees))

# build and train model with 49 additional trees for a total of 50 trees:
cars_gbm_continued <- h2o.gbm(x = predictors, y = response, training_frame = train,
                    validation_frame = valid, checkpoint = cars_gbm@model_id, ntrees = 50, seed = 1234)

# print the auc for the validation data
print(h2o.auc(cars_gbm_continued, valid = TRUE))

# you can also use checkpointing to pass in a new dataset (see options above for parameters you cannot change)
# simply change out the training and validation frames with your new dataset

Edit (Based on @Edward's comment below:)

h2o.grid will return a series of models, and you can get the best model handel. All the parameters are saved in the model handel, then you can apply the parameters to new model.

grid <- h2o.getGrid(h2o.grid@grid_id,sort_by = "auc",decreasing=TRUE)
model.h2o <- h2o.getModel(grid@model_ids[[1]])

model@allparameters includes all parameters used, and you can use those to create a new model and new data.

R H2O grid search: how to train top model on new data?

Answers (1)

Edit (Based on @Edward's comment below:)

Related Questions