user1274878
user1274878

Reputation: 1405

sklearn: Anomaly detection using Isolation Forests

I have a training dataset which contains no outliers:

train_vectors.shape
(588649, 896)

And, I have another set of test vectors (test_vectors), and all of them are outliers.

Here is my attempt at doing the outlier detection:

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=0.01)
clf.fit(train_vectors)
y_pred_train = clf.predict(train_vectors)
print(len(y_pred_train))
print(np.count_nonzero(y_pred_train == 1))
print(np.count_nonzero(y_pred_train == -1))

Output:
 588649
 529771
 58878

So, here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set.

Testing code and results:

y_pred_test = clf.predict(test_vectors)
print(len(y_pred_test))
print(np.count_nonzero(y_pred_test == 1))
print(np.count_nonzero(y_pred_test == -1))

Output:
 100
 83
 17

So, it detects only 17 anomalies out of the 100. Can someone please tell me how to improve the performance. I am not at all sure why the algorithm requires the user to specify the contamination parameter. It is clear to me that it is used as a threshold, but how am I to know beforehand about the contamination level. Thank you!

Upvotes: 0

Views: 3788

Answers (2)

tegraze
tegraze

Reputation: 25

Although this question is a couple of years old, I'm posting this for future references and people asking similar questions, as I'm currently in a similar situation.

In the Scikit Learn Documentation it states:

Outlier detection: The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.

Novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.

Judging from this part of the question "(..)here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set." which suggests that what you may want to use is actually Novelty Detection instead.

As @mkaran suggested, OneClassSVM can be used for Novelty Detection, however, since it's somewhat slow, I would suggest anyone in this situation to try to use Local Outlier Factor instead. Also, from sklearn version 0.22, no contamination is needed for IsolationForest algorithm which may be very useful.

Upvotes: 0

mkaran
mkaran

Reputation: 2718

IsolationForest works a bit differently than what you described :). The contamination is:

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. link

Which means that your train set should contain about 10% of outliers. Ideally, your test set should contain about the same amount of outliers also - and it should not consist of outliers only.

train set and test set proportions
------------------------------------------------
|  normal ~ 90%                  | outliers 10%|
------------------------------------------------

Try to change your dataset proportions as described and try again with the code you posted!

Hope this helps, good luck!

P.S. You can also try OneClassSVM which is trained with the normal instances only - the test set should also be pretty much like above and not only outliers though.

Upvotes: 1

Related Questions