lampShadesDrifter
lampShadesDrifter

Reputation: 4139

H2O stacked ensemble with models using different inputs

Using h2o flow, is there a way to create a stacked ensemble model based on individual models that may not take the same inputs but predict on the same response labels.

Eg. I am trying to predict for miscoded healthcare claims (ie. charges) and would like to train models for a stacked ensemble of the form:

model1(diagnosis1, diagnosis2, ..., diagnosis5) -> denied or paid (by insurer)
model2(procedure, procedure_detail1, ..., procedure_detail5) -> denied or paid 
model3(service_date, insurance_amount, insurer_id) -> (same)
model4(pat_age, pat_sex, ...) -> (same)
...

Is there a way to do this in h2o flow (can't tell how to do this with what is presented in the h2o flow gui for stacked ensemble)? Is this even a sensible way to go about this or is it confused in some way (relatively new to machine learning)? Thanks.

Upvotes: 2

Views: 2226

Answers (2)

Erin LeDell
Erin LeDell

Reputation: 8819

Darren's response that you can't do this in H2O was correct until very recently -- H2O just removed the requirement that the base models had to be trained on the same set of inputs since it's not actually required by the Stacked Ensemble algorithm. This is only available on the nightly releases off of master though, so even if you're on the latest stable release, you'd see an error that looks like this (in Flow, R, Python, etc) if you tried to use models that don't use the exact same columns:

Error: water.exceptions.H2OIllegalArgumentException: Base models are inconsistent: they use different column lists.  Found: [x6, x7, x4, x5, x2, x3, x1, x9, x8, x10, response] and: [x10, x16, x15, x18, x17, x12, x11, x14, x13, x19, x9, x8, x20, x21, x28, x27, x26, x25, x24, x23, x22, x6, x7, x4, x5, x2, x3, x1, response].  

The metalearning step in the Stacked Ensemble algorithm combines the output from the base models, so the number of inputs that went into training the base models doesn't really matter. Currently, H2O still requires that the inputs are all part of the same original training_frame -- but you can use a different x for each base model if you like (the x argument specifies which of the columns from the training_frame you want to use in your model).

The way that Stacked Ensemble works in Flow is that it looks for models that are all "compatible", in other words -- trained on, the same data frame. Then you select from this list which ones you want to include in the ensemble. So as long as you are using the latest development version of H2O, then this is how to do what you want to do in Flow.

select ensemble base models in H2O Flow

Here's an R example of how to ensemble models that are trained on different subsets of the feature space:

library(h2o)
h2o.init()

# Import a sample binary outcome training set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])

# Train & Cross-validate a GBM using a subset of features
my_gbm <- h2o.gbm(x = x[1:10],
                  y = y,
                  training_frame = train,
                  distribution = "bernoulli",
                  nfolds = 5,
                  keep_cross_validation_predictions = TRUE,
                  seed = 1)

# Train & Cross-validate a RF using a subset of features
my_rf <- h2o.randomForest(x = x[3:15],
                          y = y,
                          training_frame = train,
                          nfolds = 5,
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)

# Train a stacked ensemble using the GBM and RF above
ensemble <- h2o.stackedEnsemble(y = y, training_frame = train,
                                base_models = list(my_gbm, my_rf))

# Check out ensemble performance
perf <- h2o.performance(ensemble, newdata = test)
h2o.auc(perf)

Upvotes: 2

Darren Cook
Darren Cook

Reputation: 28928

A stacked ensemble won't do this, as it does require identical inputs to each model. But you can set up a looser kind of ensemble... and that can almost, but not quite, be done in Flow.

Basically, you would create your four models. Then you would run predict on each of them. Each predict() will give you a new h2o frame. You would then need to cbind (column-bind) those four predictions together, to give you a new h2o frame with 4 binary columns (*). Then that would be fed into a 5th model, that gives you a combined result.

*: This is the bit I don't think you can do in Flow. You would need to export the data, combine it in another application, then bring it back in.

A better approach would be to be build a single model using all the inputs together. This would be both simpler, and give you more accurate results (as, e.g. interactions between insurance_amount and pat_age could be discovered). But, the (potentially major) downside is you cannot explain the model as four sets of yes/no any more. I.e. it becomes more black-box-like.

Upvotes: 1

Related Questions