Reputation: 3
I am working on a Anomaly detection model and would need help with identifying the anomalies in data transfer. Example: If an employee is connected using VPN and we have the following data usage:
EMPID date Bytes_sent Bytes recieved
A123 Timestamp 222222 3333333
A123 Timestamp 444444 6666666
A123 Timestamp 99999999 88888888888
I want to flag row 3 as abnormal since the employee has been sending or receiving within a range and then there is a sudden jump. I want to keep track of the bytes sent and received in the recent days - meaning how his behavior is changing over the recent few days.
Upvotes: 0
Views: 156
Reputation: 663
One way is keeping additional metrics for each observation:
For Bytes_recieved:
N will be based on the amount of observation you want to consider. You mentioned recent days, so you could set N = "recent" * average events per day
E.g:
EMPID date Bytes_sent Bytes_recieved br-avg-last-N br-sd-last-N br-Outlier
A123 Timestamp 222222 3333333 3333333 2357022.368 FALSE
A123 Timestamp 444444 6666666 4999999.5 2356922.368 FALSE
A123 Timestamp 99999999 88888888888 N/A N/A TRUE
Bytes_recieved Outlier for row three is calculated as whether the observed Bytes_recieved is outside the range defined by:
(last Bytes_recieved Average-Last-10) - 2*(last Bytes_recieved SD-Last-N) And (last Bytes_recieved Average-Last-10) + 2*(last Bytes_recieved SD-Last-N)
4999999.5 + 2 * 2356922.368 = 9713844.236; 9,713,844.236 < 88,888,888,888 -> TRUE
2 Standard deviations will give you 96% outliers, i.e. extreme observations you will only see ~4% of the time. You can modify it to your needs.
You can either do the same for Bytes_sent and have an 'Or' condition for the outlier decision, or calculate distance from a multi dimensional running average (here X is Bytes_sent and Y is Bytes_recieved) and mark outliers based on extreme distances. (You'll need to track a running SD or another spread metric per observation)
This way you could also easily add dimensions: time of day anomalies, extreme differences between Bytes_sent and Bytes_recieved etc.
Upvotes: 1