Reputation: 23
I am working on a project to detect anomalies in web users activity in real-time. Any ill intention or malicious activity of the user has to be detected in real-time. Input data is clickstream data of users. Click data contains user-id ( Unique user ID), click URL ( URL of web page), Click text (Text/function in the website on which user has clicked) and Information (Any information typed by user). This project is similar to an Intrusion detection system (IDS). I am using python 3.6 and I have the following queries,
I am really confused about how to approach data preprocessing. Any insight or suggestions would be appreciated
Upvotes: 0
Views: 260
Reputation: 1
In some recent personal and professional projects when faced with the challenge of applying ML on streaming data I have had success with the python library River https://github.com/online-ml/river.
Some online algorithms can handle labelled values (like hoeffding trees) so depending on what you want to achieve you may not need to conduct preprocessing.
If you do need to conduct preprocessing, label encoding and one hot encoding could be applied in an incremental fashion. Below is some code to get you started. River also has a number of classes to help out with feature extraction and feature selection e.g: TF-IDF, bag of words or frequency aggregations.
online_label_enc = {}
for click in click_stream:
try:
label_enc = click[click__feature_label_of_interest]
except KeyError:
click[click__feature_label_of_interest] = len(online_label_enc)
label_enc = click[click__feature_label_of_interest]
Upvotes: 0