Geir Inge
Geir Inge

Reputation: 189

H2O GAM train: parameter "fold_column" not working

I can not make the parameter "fold_column" work with the H2OGeneralizedAdditiveEstimator, using Python.

I need to create folds outside H2O, and read the finished Pandas DataFrame into a H2OFrame. In the H2OFrame there is a column "fold_number". I can loop through the folds and train models for each fold. But when running GAM training with fold_column="fold_number" it fails, "Not enough data to create 2 random cross-validation splits". But I just made those two models! Even if I enhance the data set a lot, by adding modified copies of the original, it fails. Everything works fine with H2OGeneralizedLinearEstimator.

Any tips on this - or is this bug?

I am running python=3.6.13, h2o=3.32.1.3, pandas=0.25.3, numpy=1.19.5, sklearn=0.24.2. Java Version: openjdk version "14.0.2".

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
import h2o
from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator

h2o.no_progress()
h2o.init()

np.random.seed(42)
boston = load_boston()
y = pd.Series(boston["target"], name="y")
X = pd.DataFrame(boston["data"], columns=boston["feature_names"])  # shape: (506, 13)
myweight = pd.Series(np.random.random_sample((len(y),)), name="myweight2")

predictors = ['CRIM', 'AGE']
gam_columns = ['CRIM']

params = {
    "family": "gaussian",
    "gam_columns": gam_columns,
    'bs': len(gam_columns) * [0],
}

fold = pd.Series(np.append(np.zeros(253), np.ones(253)), dtype=int, index=y.index, name="fold_number")
df0 = pd.concat([y, X, myweight, fold], axis=1)
df = h2o.H2OFrame(python_obj=df0)

# df["fold_number"] = df["fold_number"].asfactor()

for i in [0, 1]:
    mask = df["fold_number"] == i
    df_train = df[~mask, :]
    df_val = df[mask, :]

    model = H2OGeneralizedAdditiveEstimator(**params)
    model.train(
        x=predictors,
        y="y",
        weights_column="myweight2",
        training_frame=df_train,
    )

    print("Finished training for fold_number=", i, ", with validation-RMSE=", model.rmse(df_val))

print("\nStarting training with API option fold_column=")
model2 = H2OGeneralizedAdditiveEstimator(**params)
model2.train(
    x=predictors,
    y="y",
    weights_column="myweight2",
    training_frame=df,
    fold_column="fold_number"
)
print("Finished training with API option fold_column=")

The output I get is:

Checking whether there is an H2O instance running at http://localhost:54321 . connected.
--------------------------  ------------------------------------------------

Finished training for fold_number= 0 , with validation-RMSE= 7.33788975630292
Finished training for fold_number= 1 , with validation-RMSE= 7.912477133985602

Starting training with API option fold_column=
Traceback (most recent call last):
  File "/Users/g009655/tmp7/h2otest/test_gam_cv.py", line 55, in <module>
    fold_column="fold_number"
  File "/Users/g009655/Library/Caches/pypoetry/virtualenvs/h2otest-S7Xak4Mg-py3.6/lib/python3.6/site-packages/h2o/estimators/estimator_base.py", line 115, in train
    self._train(parms, verbose=verbose)
  File "/Users/g009655/Library/Caches/pypoetry/virtualenvs/h2otest-S7Xak4Mg-py3.6/lib/python3.6/site-packages/h2o/estimators/estimator_base.py", line 207, in _train
    job.poll(poll_updates=self._print_model_scoring_history if verbose else None)
  File "/Users/g009655/Library/Caches/pypoetry/virtualenvs/h2otest-S7Xak4Mg-py3.6/lib/python3.6/site-packages/h2o/job.py", line 80, in poll
    "\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
OSError: Job with key $03017f00000132d4ffffffff$_af1219b23ff0642a316d9f092b214dc6 failed with an exception: water.exceptions.H2OIllegalArgumentException: 

Not enough data to create 2 random cross-validation splits. 
Either reduce nfolds, specify a larger dataset (or specify another random number seed, 
if applicable).

stacktrace: 
water.exceptions.H2OIllegalArgumentException: 

Not enough data to create 2 random cross-validation splits. 
Either reduce nfolds, specify a larger dataset (or specify another random number seed, 
if applicable).

    at hex.ModelBuilder.cv_makeWeights(ModelBuilder.java:726)
    at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:604)
    at hex.glm.GLM.computeCrossValidation(GLM.java:136)
    at hex.ModelBuilder$1.compute2(ModelBuilder.java:379)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1637)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Closing connection _sid_b4d8 at exit
H2O session _sid_b4d8 closed.

Process finished with exit code 1

Upvotes: 2

Views: 193

Answers (1)

Erin LeDell
Erin LeDell

Reputation: 8819

I was able to reproduce the error and indeed it's not working (neither nfolds nor fold_column seem to be working). We will fix this ASAP. Here's the Jira ticket: https://h2oai.atlassian.net/browse/PUBDEV-8163

Upvotes: 3

Related Questions