Reputation: 1244
I'm experimenting with deeplearning binary classifiers using the h2o
package. When I build a model and then use h2o.predict
on some new (held-out) dataset, I notice that for some rows, the Predict
output does not match the value with the highest probability.
Here's a reproducible example, adapted from h2o's deeplearning tutorial:
library(h2o)
h2o.init(nthreads=-1, max_mem_size="2G")
h2o.removeAll()
df <- h2o.importFile(path = "https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/tutorials/data/covtype.full.csv")
splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
train <- h2o.assign(splits[[1]], "train.hex") # 60%
valid <- h2o.assign(splits[[2]], "valid.hex") # 20%
test <- h2o.assign(splits[[3]], "test.hex") # 20%
response <- "Cover_Type"
predictors <- setdiff(names(df), response)
train$bin_response <- ifelse(train[,response]=="class_1", 0, 1)
train$bin_response <- as.factor(train$bin_response) ##make categorical
# apply same transforms to test
test$bin_response <- ifelse(test[,response]=="class_1", 0, 1)
test$bin_response <- as.factor(test$bin_response)
dlmodel <- h2o.deeplearning(
x=predictors,
y="bin_response",
training_frame=train,
hidden=c(10,10),
epochs=0.1
#balance_classes=T ## enable this for high class imbalance
)
pred <- h2o.predict(dlmodel, test)
Now let's manipulate that to bring it into R and add some new columns, using dplyr
for simplicity:
pred_df <- bind_cols(
select(as.data.frame(test), actual = bin_response),
as.data.frame(pred)
) %>%
tbl_df %>%
mutate(
derived_predict = factor(as.integer(p1 > p0)),
match = predict == derived_predict
)
Now I would think the prediction should always match the column with the highest probability, but that's not always the case:
> pred_df %>% summarize(sum(match) / n())
# A tibble: 1 x 1
sum(match)/n()
<dbl>
1 0.9691755
Why isn't that value exactly 1? In my most recent run of the above code, the p0
and p1
values are fairly close
> pred_df %>% filter(!match)
# A tibble: 3,575 x 6
actual predict p0 p1 derived_predict match
<fctr> <fctr> <dbl> <dbl> <fctr> <lgl>
1 1 1 0.5226679 0.4773321 0 FALSE
2 0 1 0.5165302 0.4834698 0 FALSE
3 0 1 0.5225683 0.4774317 0 FALSE
4 0 1 0.5120126 0.4879874 0 FALSE
5 1 1 0.5342851 0.4657149 0 FALSE
6 0 1 0.5335089 0.4664911 0 FALSE
7 0 1 0.5182881 0.4817119 0 FALSE
8 0 1 0.5094492 0.4905508 0 FALSE
9 0 1 0.5309947 0.4690053 0 FALSE
10 0 1 0.5234880 0.4765120 0 FALSE
# ... with 3,565 more rows
but that still doesn't explain why h2o.predict
chooses the less probable value.
Am I doing something wrong here? Is this a bug in h2o? Does h2o intentionally use more information in picking a prediction than it presents to me here?
Interestingly, using my derived_predict
yields slightly higher accuracy, by a hair:
> pred_df %>%
+ summarize(
+ original = sum(actual == predict) / n(),
+ derived = sum(actual == derived_predict) / n()
+ )
# A tibble: 1 x 2
original derived
<dbl> <dbl>
1 0.7794946 0.7827452
Upvotes: 3
Views: 1033
Reputation: 23608
I ran into the same issue. Trying to explain the predicted value versus the p1 value.
H2O uses maximum F1 score by default for classification. With the p1 column you can specify your own threshold.
It is not very obvious from reading the documentation. But you can find it in the R booklet. Strangely enough not in de DRF, GBM, or Deep Learning booklets.
Upvotes: 2