Code Pope
Code Pope

Reputation: 5449

Why sklearn Isolation Forest predicts wrongly?

I was checking the official sample of Isolation Forest of sklearn: IsolationForest example
I just made a small change to also plot the predicted anomalies of the fitted Isolation Forest: y_pred_train[y_pred_train ==-1,:]
Here the full code:

rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

# fit the model
clf = IsolationForest(behaviour='new', max_samples=100,
                      random_state=rng, contamination='auto')
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white',
                 s=20,alpha=0.5, edgecolor='k')
b11 = plt.scatter(X_train[y_pred_train==-1, 0], X_train[y_pred_train==-1, 1], c='grey',
                 s=20,alpha=0.5, edgecolor='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green',
                 s=20, edgecolor='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red',
                s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([b1,b11, b2, c],
           ["training observations","predicted_abnormal",
            "new regular observations", "new abnormal observations"],
           loc="upper left")
plt.show()

enter image description here

But now the fitted Isolation Forest predicts many of the datapoints which are defined as regular in the sample as anomaly. Why is this the case? Why does the original article claim that they are regular?

Upvotes: 1

Views: 1362

Answers (1)

DataBach
DataBach

Reputation: 1633

Note that in this example the outliers were generated and fit independently from the training and test data. Without the outliers being fit to the model, the abnormal values are "closer" in proximity to the normal values. Those are shown in grey in your code. However, if you fit the outliers to the model, the values that are considered abnormal change. That's why they were not shown in the article you posted. These values were "temporarily abnormal" so to speak.

Consider how the abnormally score is calculated as explained in this article: https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e

Upvotes: 1

Related Questions