Reputation: 53
Apologies if this has been answered elsewhere but i couldn't find anything.
I'm using h2o (latest release) in R. I've created a random forest model using h2o.grid (for parameter tuning) and called this 'my_rf'
My steps are as follows:
The exact line I've used for h2o.performance is:
h2o.performance(my_rf, newdata = as.h2o(test))
.... which gives me a confusion matrix, from which I can calculate accuracy (as well as giving me AUC, max F1 score etc)
I would have thought that using
h2o.predict(my_rf, newdata = as.h2o(test))
I would be able to replicate the confusion matrix from h2o.performance. But the accuracy is different - 3% worse in fact.
Is anyone able to explain why this is so?
Also, is there any way to return the predictions that make up the confusion matrix in h2o.performance?
Edit: here is the relevant code:
library(mlbench)
data(Sonar)
head(Sonar)
mainset <- Sonar
mainset$Class <- ifelse(mainset$Class == "M", 0,1) #binarize
mainset$Class <- as.factor(mainset$Class)
response <- "Class"
predictors <- setdiff(names(mainset), c(response, "name"))
# split into training and test set
library(caTools)
set.seed(123)
split = sample.split(mainset[,61], SplitRatio = 0.75)
train = subset(mainset, split == TRUE)
test = subset(mainset, split == FALSE)
# connect to h2o
Sys.unsetenv("http_proxy")
Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') #set JAVA home for 32 bit
library(h2o)
h2o.init(nthread = -1)
# stacked ensembles
nfolds <- 5
ntrees_opts <- c(20:500)
max_depth_opts <- c(4,8,12,16,20)
sample_rate_opts <- seq(0.3,1,0.05)
col_sample_rate_opts <- seq(0.3,1,0.05)
rf_hypers <- list(ntrees = ntrees_opts, max_depth = max_depth_opts,
sample_rate = sample_rate_opts,
col_sample_rate_per_tree = col_sample_rate_opts)
search_criteria <- list(strategy = 'RandomDiscrete', max_runtime_secs = 240, max_models = 15,
stopping_metric = "AUTO", stopping_tolerance = 0.00001, stopping_rounds = 5,seed = 1)
my_rf <- h2o.grid("randomForest", grid_id = "rf_grid", x = predictors, y = response,
training_frame = as.h2o(train),
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
hyper_params = rf_hypers,
search_criteria = search_criteria)
get_grid_rf <- h2o.getGrid(grid_id = "rf_grid", sort_by = "auc", decreasing = TRUE) # get grid of models built
my_rf <- h2o.getModel(get_grid_rf@model_ids[[1]])
perf_rf <- h2o.performance(my_rf, newdata = as.h2o(test))
pred <- h2o.predict(my_rf, newdata = as.h2o(test))
pred <- as.vectpr(pred$predict)
cm <- table(test[,61], pred)
print(cm)
Upvotes: 3
Views: 3579
Reputation: 11
Difference in performance()
and predict()
is explained below. It is directly from H2O's help page - http://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#prediction
Prediction Threshold
For classification problems, when running h2o.predict() or .predict(), the prediction threshold is selected as follows:
Upvotes: 0
Reputation: 1
When predicting on new data that does not have an actual result for comparison (a 'y' parameter in h2o terms), there is no F1 Max score or other metrics and you have to rely on the predictions made from h2o.predict().
Upvotes: 0
Reputation: 20571
Mostly likely, function h2o.performance is using F1 threshold to set yes and no. If you take the predict results and instrument the table to separate yes/no based based on models "F1 threshold" value you will see the number is almost match. I believe this is the main reason you see discrepancy in the results between h2o.performance and h2o.predict.
Upvotes: 3