Reputation: 11
A colleague and myself are trying to detect anomalies in a large dataset. We want to try out different algorithms (LOF, OC-SVM, DBSCAN, etc) but we are currently working with IsolationForest.
Our dataset is currently shaped a follows. It's a count of the number of event-types logged per user per day, and contains > 300.000 records:
date | user | event | count |
---|---|---|---|
6/1/2021 | user_a | Open | 2 |
6/2/2021 | user_a | Open | 4 |
6/1/2021 | user_b | Modify | 3 |
6/2/2021 | user_b | Open | 5 |
6/2/2021 | user_b | Delete | 2 |
6/3/2021 | user_b | Open | 7 |
6/5/2021 | user_b | Move | 4 |
6/4/2021 | user_c | Modify | 3 |
6/4/2021 | user_c | Move | 6 |
Our goal is to automatically detect anomalous counts of events per user. For example a user who normally logs between 5 and 10 "open" events per day a count of 400 would be an outlier. My colleague and I are having a discussion on how we should prepare the dataset for the IsolationForest algorithm.
One of us is saying we should drop the date field and labelencode the rest of the data => encode all strings by integers and let IF calculate an outlier score for each of the records.
The other is of the opinion labelencoding should NOT be done, since replacing categorical data by integers cannot be done. The data should however be scaled, the user column should be dropped (or set as index), and the data within the event column should be pivotted to generate more dimensions (the example below shows what he wants to do):
date | user | event_Open | event_Modify | event_Delete | event_Move |
---|---|---|---|---|---|
6/1/2021 | user_a | 2 | NaN | NaN | NaN |
6/2/2021 | user_a | 4 | NaN | NaN | NaN |
6/1/2021 | user_b | NaN | 3 | NaN | NaN |
6/2/2021 | user_b | 5 | NaN | 2 | NaN |
6/3/2021 | user_b | 7 | NaN | NaN | NaN |
6/5/2021 | user_b | NaN | NaN | NaN | 4 |
6/4/2021 | user_c | NaN | 3 | NaN | 6 |
So we're in disagreement on a couple of points. I'll list them below and include my thoughts on the them:
Issue | Comment |
---|---|
Labelencoding | Is a must and does not effect the categorical nature of the dataset |
Scaling | IsolationForest is by nature insensitive to scaling making scaling superfluous |
Drop data column | The date is actually not a feature in the dataset, as the date does not have any correlation to the anomalousness of the count per event-type per user |
Drop user column | User is actually a (critical) feature and should not be dropped |
Pivot event column | This generates a spare matrix, which can be bad practice. It also introduces relations within the data that are not there in reality (for example user_b on 2. june logged 5 open events and 2 delete events, but these are considered not related and should therefore not form a single record) |
I am very curious to your thought on these points. What's best practice regarding the issues listed above while using the IsolationForest algorithm for anomaly detection?
Upvotes: 1
Views: 691