Scaling data lowers the quality of clustering

Question

I'm experiencing a strange phenomenon. I have created an artifical dataset of only 2 columns filled with numbers:

If I run the k-means algorithm on it, I get the following partition:

This looks fine. Now, I scale the columns with StandardScaler and I obtain the following dataset:

But if I run the k-means algorithm on it, I get the following partition:

Now, it looks bad. How come? It is recommended to scale the numerical features before using them with k-means so I'm quite surprised by this result.

Here is the code to show the partition:

data = pd.read_csv("dataset_scaled.csv", sep = ",")
k_means = KMeans(n_clusters = 3)
k_means.fit(data)
partition =  k_means.labels_ + 1
colors = ["red", "green", "blue"]
ax = None
for i in range(1, 4):
    ax = d.iloc[partition == i].plot.scatter(x = 'a', y = 'b', color = colors[i - 1], legend = False, ax = ax)

Has QUIT--Anony-Mousse · Accepted Answer

Because your across-cluster variance is all in X, and within-cluster variance is mostly in Y, using the standardization technique reduces the quality. So don't assume a "best practise" will always be best.

This is a toy example, and real data will not look like this. Most likely, standardization does give more meaningful results.

Nevertheless, this demonstrates well that blindly scaling your data, nor blindly running clustering, will yield good results. You will always need to try different variants and study them.

Scaling data lowers the quality of clustering

Answers (1)

Related Questions