onofricamila
onofricamila

Reputation: 1089

Stream normalization for online clustering in evolving environments

TL;DR: how to normalize stream data, given that the whole data set is not available and you are dealing with clustering for evolving environments

Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....

I need to apply a standard normalization. My initial approach was to:

  1. Fill a buffer with initial data points
  2. Use those data points to get mean and standard deviation
  3. Use those measures to normalize the current data points
  4. Send those points normalized to the algorithm one by one
  5. Use the previous measures to keep normalizing incoming data points for a while
  6. Every some time calculate again mean and standard deviation
  7. Represent the current micro clusters centroids with the new measures (having the older ones it shouldn't be a problem to go back and normalize again)
  8. Use the new measures to keep normalizing incoming data points for a while
  9. And so on ....

The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....

Any ideas?

TIA

Upvotes: 2

Views: 456

Answers (2)

serxio
serxio

Reputation: 11

Data normalization affects clustering for algorithms that depend on the L2 distance. Therefore you can't really have a global solution to your question.

If your clustering algorithm supports it, one option would be to use clustering with a warm-start in the following way:

  • at each step, find the "evolved" clusters from scratch, using the samples re-normalized according to the new mean and std dev
  • do not initialize clusters randomly, but instead, use the clusters found in the previous step as represented in the new space.

Upvotes: 1

Reveille
Reveille

Reputation: 4619

As more data streams in, the estimated standardization parameters (e.g, mean and std) are updated and converge further to the true values [1, 2, 3]. In evolving environments, it is even more pronounced as the data distributions are now time-varying too [4]. Therefore, the more recent streamed samples that have been standardized using the more recent estimated standardization parameters are more accurate and representative.

A solution is to merge the present with a partial reflection of the past by embedding a new decay parameter in the update rule of your clustering algorithm. It boosts the contribution of the more recent samples that have been standardized using the more recent distribution estimates. You can see an implementation of this idea in Apache Sparks MLib [5, 6, 7]:

enter image description here

where the α is the new decay parameter; lower α makes the algorithm favor the more recent samples more.

Upvotes: 1

Related Questions