dsauce
dsauce

Reputation: 612

How to interpret the reconstruction MSE from H2O anomaly detection?

I am using h2o for anomaly detection in the data. The data contains several continuous and categorical features and the label could either be 0 or 1. Now, because the count of 1s is less than 1%, I am trying out anomaly detection technique instead of using usual classification methods. However, in the end I get MSE calculated per row of the data and I am not sure how to interpret it to be able to say that actual label is 0 but because of it is an anomaly and should be 1.

The code I am using so far:

features <- names(train.df)[!names(train.df) %in% c("label")]
train.df <- subset(train.df, label==0)
train.h <- as.h2o(train.df)

mod.dl <- h2o.deeplearning(
  x=features,
  autoencoder=TRUE,
  training_frame=train.h,
  activation=c("Tanh"),
  hidden=c(10,10), epochs=20, adaptive_rate=FALSE,
  variable_importances=TRUE, 
  l1=1e-4, l2=1e-4,
  sparse=TRUE
)

pred.oc <- as.data.frame(h2o.anomaly(mod.dl.oc, train.h.oc))

head(pred.oc):

  Reconstruction.MSE
1        0.012059304
2        0.014490905
3        0.011002231
4        0.013142910
5        0.009631915
6        0.012897779

Upvotes: 1

Views: 1051

Answers (1)

user3896928
user3896928

Reputation: 11

An autoencoder is trying to learn a nonlinear, reduced representation of the original data. It is an unsupervised approach, so it will only consider the features of the data. It is not an approach for classification.

The mean square error is a way to see how hard it is for the autoencoder to represent the output. Anomalies are considered rows/observations with high mean squared error.

In your case, the rows with the highest MSE should be considered anomalous. They could be rows that are 1s, but are labeled as 0. However, that conclusion can’t be definitely drawn from an autoencoder approach.

Upvotes: 1

Related Questions