Brudalaxe
Brudalaxe

Reputation: 191

Why does my k-means convergence condition give different results than sklearn?

I've written a function that executes k-means clustering with data, K, and n (number of iterations) as inputs. I can set n=100 so that the function calculates Euclidean distance, assigns clusters and calculates new cluster centroids 100 times over before completing. When comparing the results of this with sklearn's KMeans (using the same data and K) I get the same centroids.

I would like my function to run only until convergence (so as to avoid unnecessary iterations), so it no longer needs number of iterations as an input. To implement this I've used a while loop, whereas in my original function I used a for loop that iterates n times. The general steps of my code are as follows:

  1. Randomly create the centroids and store in the variable centroids.
  2. Calculate the Euclidean distance of each data point from the centroids generated above and store in the variable EucDist.
  3. Assign each data point a cluster based on the EucDist from centroids.
  4. Regroup the data points based on the cluster index cluster. Then compute the mean of separated clusters and assign it as new centroids.

Steps 2-4 are then looped over until convergence. Below is the code for my while loop (I'm aware there is probably a bit of unnecessary code in here, however I'm unsure where the problem is originating from, if it is even a problem).

while True:

    previouscentroids = centroids
    EucDist = np.array([]).reshape(a, 0)

    for k in range(K):
        Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
        EucDist = np.c_[EucDist, Dist]
    
    cluster = np.argmin(EucDist, axis = 1) + 1
    
    points = {}
    for k in range(K):
        points[k + 1] = np.array([]).reshape(2, 0)
    for i in range(a):
        points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
     
    for k in range(K):
        points[k + 1] = points[k + 1].T
        
    for k in range(K):
        centroids[:, k] = np.mean(points[k + 1], axis = 0)
        
    if np.all(previouscentroids - centroids == 0):
        break

My understanding of this is that once centroids is the same as it was at the start of the iteration (previouscentroids), the loop will break and I'll have my final clusters. I assumed this would be the same as the centroids produced from my original function (and the same as from sklearn), as I thought once convergence is reached, no matter how many times you iterate the loop, the clusters will remain unchanged (thus so will the centroids). Below is the for loop from my original function for comparison (very little has been changed).

for i in range(n):
    
    EucDist = np.array([]).reshape(a, 0)

    for k in range(K):
        Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
        EucDist = np.c_[EucDist, Dist]
        
    cluster = np.argmin(EucDist, axis = 1) + 1
    
    points = {}
    for k in range(K):
        points[k + 1] = np.array([]).reshape(2, 0)
    for i in range(a):
        points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
     
    for k in range(K):
        points[k + 1] = points[k + 1].T
    
    for k in range(K):
        centroids[:, k] = np.mean(points[k + 1], axis = 0)

Upvotes: 0

Views: 1174

Answers (2)

Ditlev Jørgensen
Ditlev Jørgensen

Reputation: 43

Kmeans will change from each time you run it because the initial clustercenters will be chosen at random. So unless you have excatly same initial clustercenters the results wont be the same.

Upvotes: 1

Pierre-Loic
Pierre-Loic

Reputation: 1554

K-means is a non deterministic algorithm. If you don't have the same initialization than in scikit-learn, you are not sure to find the same clusters. In scikit-learn documentation, it is written :

Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids.

Upvotes: 2

Related Questions