Akira
Akira

Reputation: 2870

Why does kmeans give exactly the same results everytime?

I have re-run kmeans 4 times and get

enter image description here

From other answers, I got that

Everytime K-Means initializes the centroid, it is generated randomly.

Could you please explain why the results are exactly the same each time?

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')

fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))

for row in ax:
    for col in row:
        kmeans = KMeans(n_clusters = 4)
        kmeans.fit(don)
        y_kmeans = kmeans.predict(don)
        col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
        centers = kmeans.cluster_centers_
        col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);

plt.show()

Upvotes: 2

Views: 3575

Answers (3)

Kundan
Kundan

Reputation: 11

Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.

The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). random_state’s value may be:

for reference : https://scikit-learn.org/stable/glossary.html#term-random_state

Upvotes: 1

Marcin
Marcin

Reputation: 1391

They are not the same. They are similar. K-means is an algorithm that is in a way moving centroids iteratively so that they become better and better at splitting data and while this process is deterministic, you have to pick initial values for those centroids and this is usually done at random. Random start, doesn't mean that final centroids will be random. They will converge to something relatively good and often similar.

Have a look at your code with this simple modification:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')

fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))

cc = []

for row in ax:
    for col in row:
        kmeans = KMeans(n_clusters = 4)
        kmeans.fit(don)
        cc.append(kmeans.cluster_centers_)
        y_kmeans = kmeans.predict(don)
        col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
        centers = kmeans.cluster_centers_
        col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);

plt.show()

cc

if you have a look at exact values of those centroids, they will look like that:

[array([[ 4.97975722,  4.93316461],
        [ 5.21715504, -0.18757547],
        [ 0.31141141,  0.06726803],
        [ 0.00747797,  5.00534801]]),
 array([[ 5.21374245, -0.18608103],
        [ 0.00747797,  5.00534801],
        [ 0.30592308,  0.06549162],
        [ 4.97975722,  4.93316461]]),
 array([[ 0.30066361,  0.06804847],
        [ 4.97975722,  4.93316461],
        [ 5.21017831, -0.18735444],
        [ 0.00747797,  5.00534801]]),
 array([[ 5.21374245, -0.18608103],
        [ 4.97975722,  4.93316461],
        [ 0.00747797,  5.00534801],
        [ 0.30592308,  0.06549162]])]

Similar, but different sets of values.

Also:

Have a look at default arguments to KMeans. There is one called n_init:

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

By default it is equal to 10. Which means every time you run k-means it actually run 10 times and picked the best result. Those best results will be even more similar, than results of a single run of k-means.

Upvotes: 4

Akira
Akira

Reputation: 2870

I post @AEF's comment to remove this question from unanswered list.

Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.

Upvotes: 1

Related Questions