Joby
Joby

Reputation: 21

Isolation Forest Implementation

I would like to use Isolation Forest for identifying the Outlier's in my dataset.

Training set contains 4000 records with 40 feature columns with value 1 or 0.

I know how to use the Isolation Forest for 2 features using the sample example given in scikit learn.

How do I use all the 40 Features and see the outliers ?

Upvotes: 2

Views: 1443

Answers (1)

Rick
Rick

Reputation: 2110

I simplified the scikit example a bit. X is your Dataset with 40 features and 4000 rows. In this example it is 3 features and 100 rows. You fit the classifier with clf.fit(X) to your numerical data X, to learn the classifier the "boundaries" of your data. In the next step you classify the same data X with respect to your learned model and get an array y with 100 entries, one for each row in your dataset. Each entry in y is -1 (Outlier) or 1 (Inliner).

import numpy as np
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)

# Generate train data
s = rng.randn(100, 5)
X = np.r_[s + 2, s - 2, s - 5]

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X)
y = clf.predict(X)

Upvotes: 2

Related Questions