The proper way of using IsolationForest to detect outliers of high-dim dataset

Question

I use the following simple IsolationForest algorithm to detect the outliers of given dataset X of 20K samples and 16 features, I run the following

train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)

clf = IsolationForest()
clf.fit(X)   # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))

I get the result:

[ 1 1 1 -1 ... 1 1 1 -1 1]

This question is: Is it logically correct to use the entire dataset X when fitting into IsolationForest or only train_X?

Bert Kellerman · Accepted Answer

Yes, it is logically correct to ultimately train on the entire dataset.

With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.

If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.

If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.

Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?

Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.

The proper way of using IsolationForest to detect outliers of high-dim dataset

Answers (1)

Related Questions