Reputation: 189
I can not make the parameter "fold_column" work with the H2OGeneralizedAdditiveEstimator, using Python.
I need to create folds outside H2O, and read the finished Pandas DataFrame into a H2OFrame. In the H2OFrame there is a column "fold_number". I can loop through the folds and train models for each fold. But when running GAM training with fold_column="fold_number" it fails, "Not enough data to create 2 random cross-validation splits". But I just made those two models! Even if I enhance the data set a lot, by adding modified copies of the original, it fails. Everything works fine with H2OGeneralizedLinearEstimator.
Any tips on this - or is this bug?
I am running python=3.6.13, h2o=3.32.1.3, pandas=0.25.3, numpy=1.19.5, sklearn=0.24.2. Java Version: openjdk version "14.0.2".
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
import h2o
from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator
h2o.no_progress()
h2o.init()
np.random.seed(42)
boston = load_boston()
y = pd.Series(boston["target"], name="y")
X = pd.DataFrame(boston["data"], columns=boston["feature_names"]) # shape: (506, 13)
myweight = pd.Series(np.random.random_sample((len(y),)), name="myweight2")
predictors = ['CRIM', 'AGE']
gam_columns = ['CRIM']
params = {
"family": "gaussian",
"gam_columns": gam_columns,
'bs': len(gam_columns) * [0],
}
fold = pd.Series(np.append(np.zeros(253), np.ones(253)), dtype=int, index=y.index, name="fold_number")
df0 = pd.concat([y, X, myweight, fold], axis=1)
df = h2o.H2OFrame(python_obj=df0)
# df["fold_number"] = df["fold_number"].asfactor()
for i in [0, 1]:
mask = df["fold_number"] == i
df_train = df[~mask, :]
df_val = df[mask, :]
model = H2OGeneralizedAdditiveEstimator(**params)
model.train(
x=predictors,
y="y",
weights_column="myweight2",
training_frame=df_train,
)
print("Finished training for fold_number=", i, ", with validation-RMSE=", model.rmse(df_val))
print("\nStarting training with API option fold_column=")
model2 = H2OGeneralizedAdditiveEstimator(**params)
model2.train(
x=predictors,
y="y",
weights_column="myweight2",
training_frame=df,
fold_column="fold_number"
)
print("Finished training with API option fold_column=")
The output I get is:
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
-------------------------- ------------------------------------------------
Finished training for fold_number= 0 , with validation-RMSE= 7.33788975630292
Finished training for fold_number= 1 , with validation-RMSE= 7.912477133985602
Starting training with API option fold_column=
Traceback (most recent call last):
File "/Users/g009655/tmp7/h2otest/test_gam_cv.py", line 55, in <module>
fold_column="fold_number"
File "/Users/g009655/Library/Caches/pypoetry/virtualenvs/h2otest-S7Xak4Mg-py3.6/lib/python3.6/site-packages/h2o/estimators/estimator_base.py", line 115, in train
self._train(parms, verbose=verbose)
File "/Users/g009655/Library/Caches/pypoetry/virtualenvs/h2otest-S7Xak4Mg-py3.6/lib/python3.6/site-packages/h2o/estimators/estimator_base.py", line 207, in _train
job.poll(poll_updates=self._print_model_scoring_history if verbose else None)
File "/Users/g009655/Library/Caches/pypoetry/virtualenvs/h2otest-S7Xak4Mg-py3.6/lib/python3.6/site-packages/h2o/job.py", line 80, in poll
"\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
OSError: Job with key $03017f00000132d4ffffffff$_af1219b23ff0642a316d9f092b214dc6 failed with an exception: water.exceptions.H2OIllegalArgumentException:
Not enough data to create 2 random cross-validation splits.
Either reduce nfolds, specify a larger dataset (or specify another random number seed,
if applicable).
stacktrace:
water.exceptions.H2OIllegalArgumentException:
Not enough data to create 2 random cross-validation splits.
Either reduce nfolds, specify a larger dataset (or specify another random number seed,
if applicable).
at hex.ModelBuilder.cv_makeWeights(ModelBuilder.java:726)
at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:604)
at hex.glm.GLM.computeCrossValidation(GLM.java:136)
at hex.ModelBuilder$1.compute2(ModelBuilder.java:379)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1637)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Closing connection _sid_b4d8 at exit
H2O session _sid_b4d8 closed.
Process finished with exit code 1
Upvotes: 2
Views: 193
Reputation: 8819
I was able to reproduce the error and indeed it's not working (neither nfolds
nor fold_column
seem to be working). We will fix this ASAP. Here's the Jira ticket: https://h2oai.atlassian.net/browse/PUBDEV-8163
Upvotes: 3