Does the total distance sum in K-means have to be always decreasing?

Question

I'm working on the k-means clustering with Java. I don't see problem in my code and it looks well. However, I don't understand something.

Step 1: Choose N number of centers. (Let there is N number of clusters)

Step 2: Put each vector into cluster with nearest center using Euclidean distance. (||v1 - v2||)

Step 3: Find new mean (=center) for each cluster

Step 4: If the center have moved significantly, go to step 2

However, when I make a plot of total of point-to-respective-center distances after each iteration, I can see that the total is decreasing all the time (although it's decreasing in general and converging well).

total distance of 2nd iteration is always shorter than first one, and is the shortest. And the total distance is slightly increasing at the 3rd iteration and converges at 4 or 5th iteration.

I believe I was told to be it should be always decreasing. What's wrong? My algorithm (implementation) or my assumption about the total distance?

Has QUIT--Anony-Mousse · Accepted Answer

It must always be decreasing for the same seed.

Maybe your error is that you use Euclidean distance.

K-means does not minimize Euclidean distances.

This is a common misconception that even half of the professors get wrong. K-means minimizes the sum-of-squares, i.e., the sum of squared Euclidean distances. And no, this does not find the solution with the smallest Euclidean distances.

So make sure you are plotting SSQ everywhere. Remove all square roots from your code. They do not belong into k-means.

Does the total distance sum in K-means have to be always decreasing?

Answers (2)

Related Questions