Reputation: 11
I want to illustrate the iterations of the k-means algorithm and I've stumbled on the sklearn
implementation (since i'm using python) : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. After some research, here are the difficulties I have :
Finding a "good" dataset. My goal is to illustrate the evolution of the algorithm so I want the clusters to change significantly between each iteration. For now i'm using the iris dataset but the clusters do not evolve that much for the first iterations. I'm guessing I could come up with a fictional dataset by myself but I'd rather use a real one.
The sklearn
implementation allows me to specify the number of maximum iterations but does not allow me to specify an exact amount of iterations I want. Ideally I want to Run the k-mean algorithm for a fixed number of iterations and storing the results of each iteration for plotting purposes.
Any response addressing these issues would be greatly appreciated :)
I apologize in advance for my poor English or lack of clarity and if a similar post has already been answered but I wasn't able to find one.
Upvotes: 1
Views: 2774
Reputation: 415
There is a page, describing the algorithm of making the animation with Matplotlib and sklearn's k-means: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5.
I hope that my humble starter code will be helpful!
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X = iris.data[:, :2]
iterations = 10
centroids = None
for i in range(iterations):
kmeans = KMeans(
max_iter=1,
n_init=1,
init=(centroids if centroids is not None else 'k-means++'),
n_clusters=2,
random_state=1)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
print(f'iter: {i} - first: {centroids[0]}, second: {centroids[1]}')
How clusters centroids evolve depends not only on the dataset, but on the initial centroids locations as well as the exact algorithm to perform k-means clustering (there are three available in sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). The most dramatic change in centroids positions would happen for the first few iterations and further changes may not be visible on the plot. In my opinion, the iris dataset is a nice educational example and the two things to experiment on are:
'random'
instead of 'k-means++'
as a value for init
argument) or exact values of your choice.I think there is no option to set a particular number of iterations to k-means algorithm in sklearn is because it proceeds until it converges within the given tolerance (1e-4
by default) each run. Trying to do so is kind of against the definition of the algorithm itself, although it is very useful in demonstration purposes!
Upvotes: 1