Gavin
Gavin

Reputation: 1521

For grid search using H2O in Python, where should we set the seed?

I saw there are a few places that we can set up seed when we do grid search for tuning hyper parameters, for example, we can set up seed in the following 3 places

  1. when we initialize the estimator using H2OGradientBoostingEstimator,
  2. when we define the search_criteria, we can also put seed
  3. when we start to use the defined grid to train, we can also put seed in the train function

Are these 3 redundant, we only need set up in one of them or each of them play different role?

thanks!

Upvotes: 0

Views: 1147

Answers (1)

Lauren
Lauren

Reputation: 5778

There are two places where you can specify a seed when using the Python API

1) The Estimator, let's take GBM as the example

gbm = H2OGradientBoostingEstimator(nfolds=5, seed=1234)
gbm.train(x=features,y=response,training_frame=train)

Notice how I don't specify a seed within the train method. If you pass a seed argument to train it will break.

From the API docs you can see that no seed argument is provided.

train(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, model_id=None, verbose=False)

From the documentation here is the definition for an Estimator's seed.

This option specifies the random number generator (RNG) seed for algorithms that are dependent on randomization. When a seed is defined, the algorithm will behave deterministically. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.

2) the search_criteria in H2OGridSearch. For the docs:

More about search_criteria: This is a dictionary of control parameters for smarter hyperparameter search. The dictionary can include values for: strategy, max_models, max_runtime_secs, stopping_metric, stopping_tolerance, stopping_rounds and seed. The default value for strategy, “Cartesian”, covers the entire space of hyperparameter combinations. If you want to use cartesian grid search, you can leave the search_criteria argument unspecified. Specify the “RandomDiscrete” strategy to perform a random search of all the combinations of your hyperparameters. RandomDiscrete should be usually combined with at least one early stopping criterion, max_models and/or max_runtime_secs. Some examples below:

While you can pass in a seed parameter to the train method for grid search without having anything break, the seed parameter there does nothing. If you want to have reproducible grid search runs you need to specify the seed argument in the search_criteria parameter like so

# build grid search with previously made GBM and hyper parameters
grid = H2OGridSearch(model = my_model, hyper_params = hyper_params,
                     search_criteria = {'strategy': "RandomDiscrete", "max_runtime_secs" : 10, "seed" :1234})

# train using the grid
grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

Upvotes: 1

Related Questions