k-means - is it possible to get worse result with higher k?

Question

I want to find optimal k for using k-means on a data set. I use code below:

Sum_of_squared_distances = []
for k in range(1,15):
    print(k)
    Sum_of_squared_distances.append(KMeans(n_clusters=k).fit(x).inertia_)
plt.plot(range(1,15), Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.xticks(range(1,15))
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.savefig('optimal-k.jpg')
plt.show()

my result looks like this:

as you can see for some values of k, higher k has worse result. I want to know is this possible or I am doing something wrong? some in depth explanation would be appreciated.

Cihan · Accepted Answer

It is possible yes.

K-means is not exactly solved for a given k -- it's an NP-Hard problem. What you see are the locally optimum results that are achieved through a heuristic-based optimization algorithm.

To be precise, if you were to be able to exactly solve it for a given k, you would see a graph strictly going down. But because you can't, you see a pattern as in your plot.

The algorithm's result depends on the initial random start point, so if you ran each k many times (i.e. increase the n_init parameter in KMeans), you'd see a graph more smoothly decreasing, because then you'd remove some of the noise.

k-means - is it possible to get worse result with higher k?

Answers (1)

Related Questions