Reputation: 3182
I have a data pipeline system where all events are stored in Apache Kafka. There is an event processing layer, which consumes and transforms that data (time series) and then stores the resulting data set into Apache Cassandra.
Now I want to use Apache Spark in order train some machine learning models for anomaly detection. The idea is to run the k-means algorithm on the past data for example for every single hour in a day.
For example, I can select all events from 4pm-5pm and build a model for that interval. If I apply this approach, I will get exactly 24 models (centroids for every single hour).
If the algorithm performs well, I can reduce the size of my interval to be for example 5 minutes.
Is it a good approach to do anomaly detection on time series data?
Upvotes: 1
Views: 2024
Reputation: 7732
I have to say that strategy is good to find the Outliers but you need to take care of few steps. First, using all events of every 5 minutes to create a new Centroid for event. I think tahat could be not a good idea.
Because using too many centroids you can make really hard to find the Outliers, and that is what you don't want.
So let's see a good strategy:
So, your Idea is good? Yes! It works, but make sure to not do over working in the cluster. And of course you can take your data sets of every day to train even more your model. But make this process of find the centroids once a day. And let the Euclidian distance method find what is or not in your groups.
I hope that I helped you!
Upvotes: 1