Dina Stack
Dina Stack

Reputation: 41

Is Isolation Forest a Distance-Based Model?

I know five standard unsupervised outlier detection approaches :

But what type is Isolation Forest? According to google it is a ''tree-based model'', but is it also possible to say that isolation forest is a Distance-based model?

Upvotes: 0

Views: 531

Answers (1)

David Thery
David Thery

Reputation: 719

While the Anomaly detection's wikipedia page states that it is a Density-based technique, you should instead refer to the original paper, and Scikit-learn documentation.

Isolation forest is indeed useful for anomaly detection, and is particularly efficient for large datasets. It is represented by a tree structure, and given that it uses recursive partitioning,
the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

In addition to all the details provided in the Scikit-learn docs, you can read in the source paper:

Apart from the key difference of isolation versus profiling, iForest is distinguished from existing model-based, distance-based and density-based methods in the following ways:

  1. The isolation characteristic of iTrees enables them to build partial models and exploit sub-sampling to an extent that is not feasible in existing methods. Since a large part of an iTree that isolates normal points is not needed for anomaly detection; it does not need to be constructed. A small sample size produces better iTrees because the swamping and masking effects are reduced.
  1. iForest utilizes no distance or density measures to detect anomalies. This eliminates major computational cost of distance calculation in all distance-based methods and density-based methods.
  1. iForest has a linear time complexity with a low constant and a low memory requirement. To our best knowledge, the best-performing existing method achieves only approximate linear time complexity with high memory usage.
  1. iForest has the capacity to scale up to handle extremely large data size and high-dimensional problems with a large number of irrelevant attributes.

Upvotes: 2

Related Questions