J Gee
J Gee

Reputation: 127

Does the total distance sum in K-means have to be always decreasing?

I'm working on the k-means clustering with Java. I don't see problem in my code and it looks well. However, I don't understand something.

Step 1: Choose N number of centers. (Let there is N number of clusters)

Step 2: Put each vector into cluster with nearest center using Euclidean distance. (||v1 - v2||)

Step 3: Find new mean (=center) for each cluster

Step 4: If the center have moved significantly, go to step 2

However, when I make a plot of total of point-to-respective-center distances after each iteration, I can see that the total is decreasing all the time (although it's decreasing in general and converging well). k mean clustering

total distance of 2nd iteration is always shorter than first one, and is the shortest. And the total distance is slightly increasing at the 3rd iteration and converges at 4 or 5th iteration.

I believe I was told to be it should be always decreasing. What's wrong? My algorithm (implementation) or my assumption about the total distance?

Upvotes: 2

Views: 2333

Answers (2)

Hello Worlds
Hello Worlds

Reputation: 41

Additional comments:

  1. Don't minimize variances (or equivalently, standard deviations), as tempting as it might be:

Minimizing sum of squared distances is not equivalent to minimizing variances, but that hasn't stopped people from suggesting it as the proper objective for k-means.

It is easy to imagine why this could be a bad idea:

Imagine a single point that is almost mid-way (Euclidean) between two cluster centroids, both with the same variance before including the new point. Now imagine one of the clusters has a much larger membership of points than the other cluster. Let's say the new point is slightly closer to the one with the much larger membership. Adding the new point to the larger cluster, though correct because it is closer to that centroid, won't decrease its variance nearly as much as adding the new point to the other cluster with the much smaller membership.

  1. If you are minimizing the proper objective function, but it still isn't decreasing monotonically, check that you aren't quantizing your centroid means:

This would happen, for example, if you are performing image segmentation with integer values that range in [0, 255] rather than float values in [0, 1], and you are forcing the centroid means to be uint8 datatypes.

Whenever the centroid means are found, they should then be used in the objective function as-is. If your algorithm is finding one value for centroid means (floats), but is then minimizing the objective with other values (byte ints), this could lead to unacceptable variations from the supposed monotonically decreasing objective.

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

It must always be decreasing for the same seed.

Maybe your error is that you use Euclidean distance.

K-means does not minimize Euclidean distances.

This is a common misconception that even half of the professors get wrong. K-means minimizes the sum-of-squares, i.e., the sum of squared Euclidean distances. And no, this does not find the solution with the smallest Euclidean distances.

So make sure you are plotting SSQ everywhere. Remove all square roots from your code. They do not belong into k-means.

Upvotes: 5

Related Questions