Reputation: 201
I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation
Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?
Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?
This is my code:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))
I tried by loading my dataset to X_train
but that does not seem to work.
Upvotes: 6
Views: 6127
Reputation: 25199
Do I need a separate dataset to train the model?
Short answer is "No". You train and predict outliers on the same data.
IsolationForest
is an unsupervised learning algorithm that's intended to clean your data from outliers (see docs for more). In usual machine learning settings, you would run it to clean your training dataset. As far as your toy example concerned:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=rng, behaviour="new", contamination=.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train
array([ 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, -1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
where 1
represent inliers and -1
represent outliers. As specified by contamination
param, the fraction of outliers is 0.1
.
Finally, you would remove outliers like:
X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]
Upvotes: 9