timDS
timDS

Reputation: 11

IsolationForest, transforming data

A colleague and myself are trying to detect anomalies in a large dataset. We want to try out different algorithms (LOF, OC-SVM, DBSCAN, etc) but we are currently working with IsolationForest.

Our dataset is currently shaped a follows. It's a count of the number of event-types logged per user per day, and contains > 300.000 records:

date user event count
6/1/2021 user_a Open 2
6/2/2021 user_a Open 4
6/1/2021 user_b Modify 3
6/2/2021 user_b Open 5
6/2/2021 user_b Delete 2
6/3/2021 user_b Open 7
6/5/2021 user_b Move 4
6/4/2021 user_c Modify 3
6/4/2021 user_c Move 6

Our goal is to automatically detect anomalous counts of events per user. For example a user who normally logs between 5 and 10 "open" events per day a count of 400 would be an outlier. My colleague and I are having a discussion on how we should prepare the dataset for the IsolationForest algorithm.

One of us is saying we should drop the date field and labelencode the rest of the data => encode all strings by integers and let IF calculate an outlier score for each of the records.

The other is of the opinion labelencoding should NOT be done, since replacing categorical data by integers cannot be done. The data should however be scaled, the user column should be dropped (or set as index), and the data within the event column should be pivotted to generate more dimensions (the example below shows what he wants to do):

date user event_Open event_Modify event_Delete event_Move
6/1/2021 user_a 2 NaN NaN NaN
6/2/2021 user_a 4 NaN NaN NaN
6/1/2021 user_b NaN 3 NaN NaN
6/2/2021 user_b 5 NaN 2 NaN
6/3/2021 user_b 7 NaN NaN NaN
6/5/2021 user_b NaN NaN NaN 4
6/4/2021 user_c NaN 3 NaN 6

So we're in disagreement on a couple of points. I'll list them below and include my thoughts on the them:

Issue Comment
Labelencoding Is a must and does not effect the categorical nature of the dataset
Scaling IsolationForest is by nature insensitive to scaling making scaling superfluous
Drop data column The date is actually not a feature in the dataset, as the date does not have any correlation to the anomalousness of the count per event-type per user
Drop user column User is actually a (critical) feature and should not be dropped
Pivot event column This generates a spare matrix, which can be bad practice. It also introduces relations within the data that are not there in reality (for example user_b on 2. june logged 5 open events and 2 delete events, but these are considered not related and should therefore not form a single record)

I am very curious to your thought on these points. What's best practice regarding the issues listed above while using the IsolationForest algorithm for anomaly detection?

Upvotes: 1

Views: 691

Answers (0)

Related Questions