D.L.
D.L.

Reputation: 11

Storing K-means clustering results for each Iteration using scikit-learn

I want to illustrate the iterations of the k-means algorithm and I've stumbled on the sklearn implementation (since i'm using python) : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. After some research, here are the difficulties I have :

  1. Finding a "good" dataset. My goal is to illustrate the evolution of the algorithm so I want the clusters to change significantly between each iteration. For now i'm using the iris dataset but the clusters do not evolve that much for the first iterations. I'm guessing I could come up with a fictional dataset by myself but I'd rather use a real one.

  2. The sklearn implementation allows me to specify the number of maximum iterations but does not allow me to specify an exact amount of iterations I want. Ideally I want to Run the k-mean algorithm for a fixed number of iterations and storing the results of each iteration for plotting purposes.

Any response addressing these issues would be greatly appreciated :)

I apologize in advance for my poor English or lack of clarity and if a similar post has already been answered but I wasn't able to find one.

Upvotes: 1

Views: 2774

Answers (1)

Lev Pleshkov
Lev Pleshkov

Reputation: 415

There is a page, describing the algorithm of making the animation with Matplotlib and sklearn's k-means: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5.

I hope that my humble starter code will be helpful!

from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, :2]

iterations = 10
centroids = None

for i in range(iterations):
    kmeans = KMeans(
        max_iter=1,
        n_init=1,
        init=(centroids if centroids is not None else 'k-means++'),
        n_clusters=2,
        random_state=1)
    kmeans.fit(X)
    centroids = kmeans.cluster_centers_
    print(f'iter: {i} - first: {centroids[0]}, second: {centroids[1]}')

Some thoughts on the 1st question

How clusters centroids evolve depends not only on the dataset, but on the initial centroids locations as well as the exact algorithm to perform k-means clustering (there are three available in sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). The most dramatic change in centroids positions would happen for the first few iterations and further changes may not be visible on the plot. In my opinion, the iris dataset is a nice educational example and the two things to experiment on are:

  • choosing a pair of different parameters (e.g. petal length vs sepal width would probably produce a different result comparing to sepal length vs sepal width),
  • initialising initial centroids with random coordinates (try using 'random' instead of 'k-means++' as a value for init argument) or exact values of your choice.

Some thoughts on the 2nd question

I think there is no option to set a particular number of iterations to k-means algorithm in sklearn is because it proceeds until it converges within the given tolerance (1e-4 by default) each run. Trying to do so is kind of against the definition of the algorithm itself, although it is very useful in demonstration purposes!

Upvotes: 1

Related Questions