Reputation: 1089
Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....
I need to apply a standard normalization. My initial approach was to:
The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....
Any ideas?
TIA
Upvotes: 2
Views: 456
Reputation: 11
Data normalization affects clustering for algorithms that depend on the L2 distance. Therefore you can't really have a global solution to your question.
If your clustering algorithm supports it, one option would be to use clustering with a warm-start in the following way:
Upvotes: 1
Reputation: 4619
As more data streams in, the estimated standardization parameters (e.g, mean and std) are updated and converge further to the true values [1, 2, 3]. In evolving environments, it is even more pronounced as the data distributions are now time-varying too [4]. Therefore, the more recent streamed samples that have been standardized using the more recent estimated standardization parameters are more accurate and representative.
A solution is to merge the present with a partial reflection of the past by embedding a new decay parameter in the update rule of your clustering algorithm. It boosts the contribution of the more recent samples that have been standardized using the more recent distribution estimates. You can see an implementation of this idea in Apache Sparks MLib [5, 6, 7]:
where the α is the new decay parameter; lower α makes the algorithm favor the more recent samples more.
Upvotes: 1