Joshua Olney
Joshua Olney

Reputation: 53

h2o.performance predictions differ from h2o.predict?

Apologies if this has been answered elsewhere but i couldn't find anything.

I'm using h2o (latest release) in R. I've created a random forest model using h2o.grid (for parameter tuning) and called this 'my_rf'

My steps are as follows:

  1. train grid of `randomForests with parameter tuning & cross validation (nfolds = 5)
  2. get the sorted grid of models (by AUC) and set my_rf = best model
  3. use h2o performance(my_rf, test) to assess auc, accuracy etc on a test set
  4. predict on test set using h2o.predict and export results

The exact line I've used for h2o.performance is:

h2o.performance(my_rf, newdata = as.h2o(test))

.... which gives me a confusion matrix, from which I can calculate accuracy (as well as giving me AUC, max F1 score etc)

I would have thought that using

h2o.predict(my_rf, newdata = as.h2o(test)) 

I would be able to replicate the confusion matrix from h2o.performance. But the accuracy is different - 3% worse in fact.

Is anyone able to explain why this is so?

Also, is there any way to return the predictions that make up the confusion matrix in h2o.performance?

Edit: here is the relevant code:

library(mlbench)
data(Sonar)
head(Sonar)

mainset <- Sonar
mainset$Class <- ifelse(mainset$Class == "M", 0,1)          #binarize
mainset$Class <- as.factor(mainset$Class)

response <- "Class"
predictors <- setdiff(names(mainset), c(response, "name"))

# split into training and test set

library(caTools)
set.seed(123)
split = sample.split(mainset[,61], SplitRatio = 0.75)
train = subset(mainset, split == TRUE)
test =  subset(mainset, split == FALSE)

# connect to h2o

Sys.unsetenv("http_proxy")
Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7')                #set JAVA home for 32 bit
library(h2o)
h2o.init(nthread = -1)

# stacked ensembles

nfolds <- 5
ntrees_opts <- c(20:500)             
max_depth_opts <- c(4,8,12,16,20)
sample_rate_opts <- seq(0.3,1,0.05)
col_sample_rate_opts <- seq(0.3,1,0.05)

rf_hypers <- list(ntrees = ntrees_opts, max_depth = max_depth_opts,
                  sample_rate = sample_rate_opts,
                  col_sample_rate_per_tree = col_sample_rate_opts)

search_criteria <- list(strategy = 'RandomDiscrete', max_runtime_secs = 240, max_models = 15,
stopping_metric = "AUTO", stopping_tolerance = 0.00001, stopping_rounds = 5,seed = 1)

my_rf <- h2o.grid("randomForest", grid_id = "rf_grid", x = predictors, y = response,
                                                                training_frame = as.h2o(train),
                                                                nfolds = 5,
                                                                fold_assignment = "Modulo",
                                                                keep_cross_validation_predictions = TRUE,
                                                                hyper_params = rf_hypers,
                                                                search_criteria = search_criteria)

get_grid_rf <- h2o.getGrid(grid_id = "rf_grid", sort_by = "auc", decreasing = TRUE)                         # get grid of models built
my_rf <- h2o.getModel(get_grid_rf@model_ids[[1]])
perf_rf <- h2o.performance(my_rf, newdata = as.h2o(test))

pred <- h2o.predict(my_rf, newdata = as.h2o(test))
pred <- as.vectpr(pred$predict)

cm <- table(test[,61], pred)
print(cm)

Upvotes: 3

Views: 3579

Answers (3)

dcleere
dcleere

Reputation: 11

Difference in performance() and predict() is explained below. It is directly from H2O's help page - http://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#prediction

Prediction Threshold

For classification problems, when running h2o.predict() or .predict(), the prediction threshold is selected as follows:

  • If you train a model with only training data, the Max F1 threshold from the train data model metrics is used.
  • If you train a model with train and validation data, the Max F1 threshold from the validation data model metrics is used.
  • If you train a model with train data and set the nfold parameter, the Max F1 threshold from the training data model metrics is used.
  • If you train a model with the train data and validation data and also set the nfold parameter, the Max F1 threshold from the validation data model metrics is used.

Upvotes: 0

Art W
Art W

Reputation: 1

When predicting on new data that does not have an actual result for comparison (a 'y' parameter in h2o terms), there is no F1 Max score or other metrics and you have to rely on the predictions made from h2o.predict().

Upvotes: 0

AvkashChauhan
AvkashChauhan

Reputation: 20571

Mostly likely, function h2o.performance is using F1 threshold to set yes and no. If you take the predict results and instrument the table to separate yes/no based based on models "F1 threshold" value you will see the number is almost match. I believe this is the main reason you see discrepancy in the results between h2o.performance and h2o.predict.

Upvotes: 3

Related Questions