Reputation: 191
I've written a function that executes k-means clustering with data
, K
, and n
(number of iterations) as inputs. I can set n=100
so that the function calculates Euclidean distance, assigns clusters and calculates new cluster centroids 100 times over before completing. When comparing the results of this with sklearn's KMeans (using the same data
and K
) I get the same centroids.
I would like my function to run only until convergence (so as to avoid unnecessary iterations), so it no longer needs number of iterations as an input. To implement this I've used a while loop, whereas in my original function I used a for loop that iterates n
times. The general steps of my code are as follows:
centroids
.EucDist
.cluster
based on the EucDist
from centroids
.cluster
. Then compute the mean of separated clusters and assign it as new centroids
.Steps 2-4 are then looped over until convergence. Below is the code for my while loop (I'm aware there is probably a bit of unnecessary code in here, however I'm unsure where the problem is originating from, if it is even a problem).
while True:
previouscentroids = centroids
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
if np.all(previouscentroids - centroids == 0):
break
My understanding of this is that once centroids
is the same as it was at the start of the iteration (previouscentroids
), the loop will break and I'll have my final clusters. I assumed this would be the same as the centroids produced from my original function (and the same as from sklearn), as I thought once convergence is reached, no matter how many times you iterate the loop, the clusters will remain unchanged (thus so will the centroids). Below is the for loop from my original function for comparison (very little has been changed).
for i in range(n):
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
Upvotes: 0
Views: 1174
Reputation: 43
Kmeans will change from each time you run it because the initial clustercenters will be chosen at random. So unless you have excatly same initial clustercenters the results wont be the same.
Upvotes: 1
Reputation: 1554
K-means is a non deterministic algorithm. If you don't have the same initialization than in scikit-learn, you are not sure to find the same clusters. In scikit-learn documentation, it is written :
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids.
Upvotes: 2