IsolationForest, transforming data

Question

A colleague and myself are trying to detect anomalies in a large dataset. We want to try out different algorithms (LOF, OC-SVM, DBSCAN, etc) but we are currently working with IsolationForest.

Our dataset is currently shaped a follows. It's a count of the number of event-types logged per user per day, and contains > 300.000 records:

date	user	event	count
6/1/2021	user_a	Open	2
6/2/2021	user_a	Open	4
6/1/2021	user_b	Modify	3
6/2/2021	user_b	Open	5
6/2/2021	user_b	Delete	2
6/3/2021	user_b	Open	7
6/5/2021	user_b	Move	4
6/4/2021	user_c	Modify	3
6/4/2021	user_c	Move	6

Our goal is to automatically detect anomalous counts of events per user. For example a user who normally logs between 5 and 10 "open" events per day a count of 400 would be an outlier. My colleague and I are having a discussion on how we should prepare the dataset for the IsolationForest algorithm.

One of us is saying we should drop the date field and labelencode the rest of the data => encode all strings by integers and let IF calculate an outlier score for each of the records.

The other is of the opinion labelencoding should NOT be done, since replacing categorical data by integers cannot be done. The data should however be scaled, the user column should be dropped (or set as index), and the data within the event column should be pivotted to generate more dimensions (the example below shows what he wants to do):

date	user	event_Open	event_Modify	event_Delete	event_Move
6/1/2021	user_a	2	NaN	NaN	NaN
6/2/2021	user_a	4	NaN	NaN	NaN
6/1/2021	user_b	NaN	3	NaN	NaN
6/2/2021	user_b	5	NaN	2	NaN
6/3/2021	user_b	7	NaN	NaN	NaN
6/5/2021	user_b	NaN	NaN	NaN	4
6/4/2021	user_c	NaN	3	NaN	6

So we're in disagreement on a couple of points. I'll list them below and include my thoughts on the them:

Issue	Comment
Labelencoding	Is a must and does not effect the categorical nature of the dataset
Scaling	IsolationForest is by nature insensitive to scaling making scaling superfluous
Drop data column	The date is actually not a feature in the dataset, as the date does not have any correlation to the anomalousness of the count per event-type per user
Drop user column	User is actually a (critical) feature and should not be dropped
Pivot event column	This generates a spare matrix, which can be bad practice. It also introduces relations within the data that are not there in reality (for example user_b on 2. june logged 5 open events and 2 delete events, but these are considered not related and should therefore not form a single record)

I am very curious to your thought on these points. What's best practice regarding the issues listed above while using the IsolationForest algorithm for anomaly detection?

IsolationForest, transforming data

Answers (0)

Related Questions