endorphinus
endorphinus

Reputation: 119

Sequential k-means

Can I use cluster_center coordinates from a previous Kmeans fit as an init argument to sequentially update the cluster_center coordinates as new data arrives? Are there any drawbacks to this method?

UPDATED Online version of Scikit learns K-means:

KM = KMeans(n_clusters=3, random_state = 200, n_init = 1)
ni = 0

Until interrupted: 

for x in data:

    KM_updated = KM.fit(x)

    Updated_centroids(i) = KM_updated.cluster_centers_(i) + 1/len(KM_updated.labels_(i) + 1) * (x - KM_updated.cluster_centers_(i))
            
    KM = KMeans(n_clusters=3, random_state = 200, init = Updated_centroids(i), n_init = 1)

Upvotes: 1

Views: 1125

Answers (1)

Molessia
Molessia

Reputation: 473

Yes it is a possible solution. However, you could further improve your implementation by following this pseudo-code (for more info take a look at this post Online k-means clustering):

Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
    Acquire the next example, x
    If mi is closest to x
        Increment ni
        Replace mi by mi + (1/ni)*( x - mi)
    end_if
end_until

Following this version of the online algorithm you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can forget the new data point.

Compared to yours, in this solution you won't need to keep the past data, so it is computationally more efficient.

This exact implementation is not available in Scikit Learn. The closest implementation is probably the MiniBatchKMeans estimator with the partial_fit method.

Upvotes: 1

Related Questions