Parvathy Sarat
Parvathy Sarat

Reputation: 393

How reliable is the Elbow curve in finding K in K-Means?

So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.

The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.

I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is? Also if there's something I'm missing.

Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory.enter image description here.

enter image description here

Upvotes: 0

Views: 1643

Answers (2)

JeeyCi
JeeyCi

Reputation: 599

Stop using the Elbow Method describes Elbow method vs Silhouette score (last is considered to be reliable)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score, silhouette_samples

SEED = 42


def scatter_plot(X, y=None):
  fig, ax = plt.subplots(figsize=(7, 4))

  if y is None:
    ax.scatter(X[:, 0], X[:, 1], marker=".", s=10)
  else:
    ax.scatter(X[:, 0], X[:, 1], marker=".", s=10, c=y)

  ax.set_xlabel("x1", fontsize=14)
  ax.set_ylabel("x2", fontsize=14)
  ax.tick_params(axis='both', labelsize=8)

  ax.grid(False)
  plt.tight_layout()
  plt.show()

X, y = make_blobs(n_samples=2000, n_features=2, centers=5,
                  random_state=SEED)

scatter_plot(X)

enter image description here

N=2
def train_kmeans(X):
    # NB 2, 8 !!!
  ks = np.linspace(N, 8, 7, dtype=np.int64)
  inertias = []
  silhouettes = []
  kmeans_k = []
  for k in ks:
    kmeans = KMeans(n_clusters=k, random_state=SEED)
    kmeans.fit(X)

    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X, kmeans.labels_))
    kmeans_k.append(kmeans)

  return kmeans_k, inertias, silhouettes, ks
    
kmeans_k, inertias, silhouettes, ks = train_kmeans(X)
##print(kmeans_k, inertias, silhouettes, ks)

############### plot score
fig, ax = plt.subplots(figsize=(7, 4))

ax.plot(ks, silhouettes, "o-", color="grey", linewidth=2.5, markersize=5)

ax.set_xlabel("k", fontsize=14)
ax.set_ylabel("Silhouette", fontsize=14)
ax.tick_params(axis='both', labelsize=8)

ax.set_title("Silhouette Score", fontsize=18, fontweight="bold")
ax.grid(False)

plt.tight_layout()
plt.show()

enter image description here

#################### plot centroids with max scores
# model that is BEST acc. silhouettes SCORE
id = silhouettes.index(max(silhouettes))
kmeans= kmeans_k[id]

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use BEST acc. silhouettes SCORE trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    cmap=plt.cm.Paired,
    aspect="auto",
    origin="lower",
)

plt.plot(X[:, 0], X[:, 1], "k.", markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    s=169,
    linewidths=3,
    color="w",
    zorder=10,
)
plt.title(
    "K-means clustering \n"
    "Centroids are marked with white cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

enter image description here

p.s. MeanShift also do not need to define k quantity as param for modelling, just to give bandwidth (Should be between [0, 1] e.g. 0.5 means that the median of all pairwise distances is used)

from sklearn.cluster import MeanShift, estimate_bandwidth

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=2000);
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

The "elbow" is not even well defined so how can it be reliable?

You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable. For example, the Calinski and Harabasz (1974) variance ratio criterion. It is essentially a rescaled version that makes much more sense.

Upvotes: 2

Related Questions