Cem
Cem

Reputation: 148

Replicated runs with h2o.gbm()

I need to have replicated runs that give different results with the same hyperparameters in h2o.gbm function.

Even though I've created a loop that provides double runs for each configuration and the results of this h2o gbm model runs are being extracted by using h2o.performance function; I've just realized that each twin run has exactly same results.

What do you suggest to me for having different results by running two h2o.gbm models with the same hyperparameters?

Things that I've tried:

  1. h2o.shutdown and h2o.init with different nthreads have been tried
  2. seed argument inside of h2o.gbm has been changed and deleted
  3. Deleting score_tree_interval and stopping_round arguments

All these tries failed, and two runs with the same hyperparameters gave exact same results. Besides, I am sharing a sample hyperparameter configuration which I would like to get different results by running it twice.

h2o.gbm(x = x_col_names, y = y, 
        training_frame = train_h2o, 
        fold_column = "index_4seasons",
        ntrees = 1000, 
        max_depth = 5, 
        learn_rate = 0.1, 
        stopping_rounds = 5, 
        score_tree_interval = 10, 
        seed = 1)

Any help and comment would be appreciated.

Upvotes: 0

Views: 48

Answers (1)

Neema Mashayekhi
Neema Mashayekhi

Reputation: 930

The seed value will change the results slightly. See below demonstrating that MSE changes when using the example from the docs.

# Import the prostate dataset into H2O:
train_h2o = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

# Set the predictors and response; set the factors:
train_h2o["CAPSULE"] = train_h2o["CAPSULE"].asfactor()
x_col_names = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
y = "CAPSULE"

# Build and train first model:
pros_gbm1 = H2OGradientBoostingEstimator(
    nfolds = 5, ntrees = 1000, max_depth = 5, learn_rate = 0.1, 
    stopping_rounds = 5, score_tree_interval = 10, seed = 1)

pros_gbm1.train(x = x_col_names, y = y, 
                training_frame = train_h2o)

# Build and train the second model with only seed number changed:
pros_gbm2 = H2OGradientBoostingEstimator(
    nfolds = 5, ntrees = 1000, max_depth = 5, learn_rate = 0.1, 
    stopping_rounds = 5, score_tree_interval = 10, seed = 123456789)

pros_gbm2.train(x = x_col_names, y = y, 
                training_frame = train_h2o)

print('Model 1 MSE:', pros_gbm1.mse())
print('Model 2 MSE:', pros_gbm2.mse())

Output

Model 1 MSE: 0.020725291770552916
Model 2 MSE: 0.02189654172905499

If your dataset is giving reproducible results with different seeds and hardware settings, it may be that the it is not large or complex enough for the models to behave stochastically. You can also try changing the folds in the fold_column to see if that has an affect.

Upvotes: 1

Related Questions