carguerriero
carguerriero

Reputation: 141

How to correctly identify anomalies using Isolation Forest and resulting scores?

I'm trying to detect outliers in a dataframe using the Isolation Forest algorithm from sklearn.

Here's the code I'm using to set up the algorithm:

iForest = IsolationForest(n_estimators=100, max_samples=256, contamination='auto', random_state=1, behaviour='new')
iForest.fit(dataset)
scores = iForest.decision_function(dataset)

Now, since I don't know what a good value for the contamination could be, I would like to check my scores and decide where to draw the line based on the distribution of the scores. Here's the code for the graph and the graph itself:

plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);

enter image description here

Is it correct to assume that negative scores indicate outliers in my dataframe? I can't find a good explanation on the range of the IF scores and how these scores work (why do I get negative scores?).

Additionally, is there a way to attach these scores to the original dataset and manually check rows with negative scores to see if they make sense?

Thanks!

Upvotes: 3

Views: 6039

Answers (2)

Parthasarathy Subburaj
Parthasarathy Subburaj

Reputation: 4264

One way of approaching this problem is to make use of the score_samples method that is available in sklearn's isolationforest module. Once you have fitted the model to your data, use the score_samples method to find out the abnormality scores for each sample (lower the value more abnormal it is). Since you don't have the information about true anomalies in your data, you can sort your samples w.r.t the scores that you have obtained and manually review the records to see if the sample with the least score is actually an anomaly or not and in this process you can come up with a threshold value for classifying a data point as an anomaly, which you can use it later on for any new data to check if they are anomalies or not.

Upvotes: 5

Joey Gao
Joey Gao

Reputation: 949

the return value of score_samples is $-s(x,\psi)$ , which range is $[-1,0]$, 0 means decision length is short so abnomal.

decision_function will convert score_samples to $[-0.5,0.5]$
then predict will convert decision_function to $-1$ or $1$ by predefined anomaly rate(contamination)

iforest = IsolationForest(n_estimators=100,
                          max_features=1.0,
                          max_samples='auto',
                          contamination='auto',
                          bootstrap=False,
                          n_jobs=1,
                          random_state=1)

iforest.fit(X)

scores = iforest.score_samples(X)
predict = iforest.predict(X)
decision = iforest.decision_function(X)
offset = iforest.offset_ # default -0.5

print(offset)
print(iforest.max_samples_)
assert np.allclose(decision, scores-offset)
assert np.allclose(predict, np.where(decision<0,-1,1))

Upvotes: 3

Related Questions