Reputation: 85
I'm trying to use the IsolationForest algorithm in scikit-learn, and I'm interested in the computed score. But when calling score_samples()
I don't get the scores I expect.
And here is a plot of the corresponding scores from the IsolationForest algorithm when calling score_samples()
:
As you can see, the two series have identical scores for almost every value of the last 100 values to the right. Why? I would expect that they were different.
Furthermore there are a couple of scores below the last 100 scores, which would indicate, that they are more likely to be anomalies. But in the series plot they are way closer to the fitted data. Why is that?
Finally there is a difference in the two score series at the last 100 points. It is like there is a minimum score value that they can't exceed (eventhough some of the previous scores did?)
I've looked at the scores formula and in the paper referenced in the Scikit-Learn's documentation, but that didn't get me any closer to an answer.
What is the reason for this behaviour of the score? And are there any work-arounds to get a more "reasonable" score metric? Ideally I would like a score in the range (0, 1).
This is the code used to generates the two data series:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [16, 6]
### simulating data
np.random.seed(0)
X1 = np.concatenate((np.random.normal(loc=2.75, scale=0.1, size=335),
np.random.normal(loc=3.2, scale=0.1, size=100)))
X1_train = X1[:200]
np.random.seed(0)
X2 = np.concatenate((np.random.normal(loc=2.75, scale=0.1, size=335),
np.random.normal(loc=3.0, scale=0.1, size=100)))
X2_train = X2[:200]
### plotting simulated data
plt.plot(X1, 'x', label='values of series 1')
plt.plot(X2, '.', markersize=3, label='values of series 2')
plt.axvline(200, c='k', linestyle=(0, (5, 10)), linewidth=0.5) ### visualizing the end of the training data.
plt.legend(loc='upper left')
And this is the code used to generate the scores of the IsolationForest-algorithm:
from sklearn.ensemble import IsolationForest
### fitting isolation forests and computing scores
iso1 = IsolationForest(random_state=0).fit(X1_train.reshape(-1, 1))
score1 = iso1.score_samples(X1.reshape(-1, 1))
iso2 = IsolationForest(random_state=0).fit(X2_train.reshape(-1, 1))
score2 = iso2.score_samples(X2.reshape(-1, 1))
### plotting scores
plt.plot(score1, 'x', label='IForest score of series 1')
plt.plot(score2, '.', markersize=3, label='IForest score of series 2')
plt.axvline(200, c='k', linestyle=(0, (5, 10)), linewidth=0.5)
plt.legend(loc='lower left')
Upvotes: 0
Views: 870
Reputation: 6259
I belive this issue is caused by your anomaly samples are outside of the data distribution of the training data. In this region there will be no splits in the trees, so you just end up with "maximum anomalous".
In general this kind of data (continious regular univariate normal-distributed time-series) is not something for which IsolationForest is a very good fit. It is better at many variables, sparse data, mixed categorical data, non-typical distributions.
Other models, like a z-score or Median Absolute Deviation transformation will be more continious in this case, with increasing scores as datapoints are further away. And their score can be interpreted as a probability.
Upvotes: 1