Reputation: 141
I'm trying to detect outliers in a dataframe using the Isolation Forest algorithm from sklearn.
Here's the code I'm using to set up the algorithm:
iForest = IsolationForest(n_estimators=100, max_samples=256, contamination='auto', random_state=1, behaviour='new')
iForest.fit(dataset)
scores = iForest.decision_function(dataset)
Now, since I don't know what a good value for the contamination could be, I would like to check my scores and decide where to draw the line based on the distribution of the scores. Here's the code for the graph and the graph itself:
plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);
Is it correct to assume that negative scores indicate outliers in my dataframe? I can't find a good explanation on the range of the IF scores and how these scores work (why do I get negative scores?).
Additionally, is there a way to attach these scores to the original dataset and manually check rows with negative scores to see if they make sense?
Thanks!
Upvotes: 3
Views: 6039
Reputation: 4264
One way of approaching this problem is to make use of the score_samples method that is available in sklearn's isolationforest
module. Once you have fitted the model to your data, use the score_samples
method to find out the abnormality scores for each sample (lower the value more abnormal it is). Since you don't have the information about true anomalies in your data, you can sort your samples w.r.t the scores that you have obtained and manually review the records to see if the sample with the least score is actually an anomaly or not and in this process you can come up with a threshold value for classifying a data point as an anomaly, which you can use it later on for any new data to check if they are anomalies or not.
Upvotes: 5
Reputation: 949
the return value of score_samples
is $-s(x,\psi)$ , which range is $[-1,0]$, 0 means decision length is short so abnomal.
decision_function
will convert score_samples
to $[-0.5,0.5]$
then predict
will convert decision_function
to $-1$ or $1$ by predefined anomaly rate(contamination
)
iforest = IsolationForest(n_estimators=100,
max_features=1.0,
max_samples='auto',
contamination='auto',
bootstrap=False,
n_jobs=1,
random_state=1)
iforest.fit(X)
scores = iforest.score_samples(X)
predict = iforest.predict(X)
decision = iforest.decision_function(X)
offset = iforest.offset_ # default -0.5
print(offset)
print(iforest.max_samples_)
assert np.allclose(decision, scores-offset)
assert np.allclose(predict, np.where(decision<0,-1,1))
Upvotes: 3