Reputation: 513

Trying to understand isolation forest algorithm

I am trying to use isolation forest algorithm with Python scikit-learn.

I do not understand why do I have to generate the sets X_test and X_outliers, because, when I get my data, I have no idea if there are outliers or not in it. But maybe this is just an example and I do not have to generate and fill that sets for every case. I thought that isolation forest does not have to receive a clean X_train (with no outliers).

Did I misunderstand the algorithm? Do I have to use an other algorithm (I thought about one-class SVM but its X_train has to be as clean as possible)?

Does the isolation forest algorithm is an unsupervised algorithm or a supervised one (like the random forest algorithm)?

Upvotes: 9

Answers (3)

Kerem Ürkmez

Reputation: 11

Isolation forest and dbscan methods are among the prominent methods for nonparametric structures. The advantage of isolation forest method is that there is no need to scaling beforehand, but it can't work with missing values. So you have to deal with it.

preds = iso.fit_predict(train_nan_dropped_for_isoF)

And remember, .fit_predict() will not be used for test data. just .predict()

How do I solve the problem?

Upvotes: 0

Amar nayak

Reputation: 157

"Does the isolation forest algorithm is an unsupervised algorithm or a supervised one (like the random forest algorithm)?"

Isolation tree is an unsupervised algorithm and therefore it does not need labels to identify the outlier/anomaly. It follows the following steps:

Random and recursive partition of data is carried out, which is represented as a tree (random forest). This is the training stage where the user defines the parameters of the subsample and the number of trees. The author (Liu and Ting, 2008) suggest the default value of 256 for sub sample and 100 trees. The convergence is reached as the number of tree increases. However, fine tuning may be required on the case basis.

The end of the tree is reached once the recursive partition of data is finished. It is expected that the distance taken to reach the outlier is far less than that for the normal data (see the figure).
The distance of the path is averaged and normalised to calculate the anomaly score. Anomaly score of 1 is considered as an outlier, values close to 0 is considered normal.

The judgment of the outlier is carried out on the basis of the score. There is no need for a label column. Therefore it is an unsupervised algorithm.

Upvotes: 12

MMF

Reputation: 5929

Question :

I do not understand why do I have to generate the sets X_test and X_outliers, because, when I get my data, I have no idea if there are outliers or not in it.

Answer :

You don't have to generate X_outliers. This is juste an example to show you that the Isolation Forest can detect outliers. This dataset is random. It has nothing to do with the original data.

What you need to do is only fitting your IsolationForest to your training data. And then if you want, check in a test set - as a preprocessing step - if there are some outliers.

Upvotes: 8

Trying to understand isolation forest algorithm

Answers (3)

Related Questions