ClaytonJY
ClaytonJY

Reputation: 1244

Unexpected predictions in h2o.deeplearning

I'm experimenting with deeplearning binary classifiers using the h2o package. When I build a model and then use h2o.predict on some new (held-out) dataset, I notice that for some rows, the Predict output does not match the value with the highest probability.

Here's a reproducible example, adapted from h2o's deeplearning tutorial:

library(h2o)

h2o.init(nthreads=-1, max_mem_size="2G")
h2o.removeAll()


df <- h2o.importFile(path = "https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/tutorials/data/covtype.full.csv")

splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
train  <- h2o.assign(splits[[1]], "train.hex") # 60%
valid  <- h2o.assign(splits[[2]], "valid.hex") # 20%
test   <- h2o.assign(splits[[3]], "test.hex")  # 20%

response <- "Cover_Type"
predictors <- setdiff(names(df), response)

train$bin_response <- ifelse(train[,response]=="class_1", 0, 1)
train$bin_response <- as.factor(train$bin_response) ##make categorical

# apply same transforms to test
test$bin_response <- ifelse(test[,response]=="class_1", 0, 1)
test$bin_response <- as.factor(test$bin_response)

dlmodel <- h2o.deeplearning(
  x=predictors,
  y="bin_response", 
  training_frame=train,
  hidden=c(10,10),
  epochs=0.1
  #balance_classes=T    ## enable this for high class imbalance
)

pred <- h2o.predict(dlmodel, test)

Now let's manipulate that to bring it into R and add some new columns, using dplyr for simplicity:

pred_df <- bind_cols(
  select(as.data.frame(test), actual = bin_response),
  as.data.frame(pred)
) %>%
  tbl_df %>%
  mutate(
    derived_predict = factor(as.integer(p1 > p0)),
    match = predict == derived_predict
  )

Now I would think the prediction should always match the column with the highest probability, but that's not always the case:

> pred_df %>% summarize(sum(match) / n())
# A tibble: 1 x 1
  sum(match)/n()
           <dbl>
1      0.9691755

Why isn't that value exactly 1? In my most recent run of the above code, the p0 and p1 values are fairly close

> pred_df %>% filter(!match)
# A tibble: 3,575 x 6
   actual predict        p0        p1 derived_predict match
   <fctr>  <fctr>     <dbl>     <dbl>          <fctr> <lgl>
1       1       1 0.5226679 0.4773321               0 FALSE
2       0       1 0.5165302 0.4834698               0 FALSE
3       0       1 0.5225683 0.4774317               0 FALSE
4       0       1 0.5120126 0.4879874               0 FALSE
5       1       1 0.5342851 0.4657149               0 FALSE
6       0       1 0.5335089 0.4664911               0 FALSE
7       0       1 0.5182881 0.4817119               0 FALSE
8       0       1 0.5094492 0.4905508               0 FALSE
9       0       1 0.5309947 0.4690053               0 FALSE
10      0       1 0.5234880 0.4765120               0 FALSE
# ... with 3,565 more rows

but that still doesn't explain why h2o.predict chooses the less probable value.

Am I doing something wrong here? Is this a bug in h2o? Does h2o intentionally use more information in picking a prediction than it presents to me here?

Interestingly, using my derived_predict yields slightly higher accuracy, by a hair:

> pred_df %>%
+   summarize(
+     original = sum(actual == predict)         / n(),
+     derived  = sum(actual == derived_predict) / n()
+   )
# A tibble: 1 x 2
   original   derived
      <dbl>     <dbl>
1 0.7794946 0.7827452

Upvotes: 3

Views: 1033

Answers (1)

phiver
phiver

Reputation: 23608

I ran into the same issue. Trying to explain the predicted value versus the p1 value.

H2O uses maximum F1 score by default for classification. With the p1 column you can specify your own threshold.

It is not very obvious from reading the documentation. But you can find it in the R booklet. Strangely enough not in de DRF, GBM, or Deep Learning booklets.

Upvotes: 2

Related Questions