Reputation: 393
So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.
The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.
I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is? Also if there's something I'm missing.
Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory..
Upvotes: 0
Views: 1643
Reputation: 599
Stop using the Elbow Method describes Elbow method vs Silhouette score (last is considered to be reliable)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
SEED = 42
def scatter_plot(X, y=None):
fig, ax = plt.subplots(figsize=(7, 4))
if y is None:
ax.scatter(X[:, 0], X[:, 1], marker=".", s=10)
else:
ax.scatter(X[:, 0], X[:, 1], marker=".", s=10, c=y)
ax.set_xlabel("x1", fontsize=14)
ax.set_ylabel("x2", fontsize=14)
ax.tick_params(axis='both', labelsize=8)
ax.grid(False)
plt.tight_layout()
plt.show()
X, y = make_blobs(n_samples=2000, n_features=2, centers=5,
random_state=SEED)
scatter_plot(X)
N=2
def train_kmeans(X):
# NB 2, 8 !!!
ks = np.linspace(N, 8, 7, dtype=np.int64)
inertias = []
silhouettes = []
kmeans_k = []
for k in ks:
kmeans = KMeans(n_clusters=k, random_state=SEED)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X, kmeans.labels_))
kmeans_k.append(kmeans)
return kmeans_k, inertias, silhouettes, ks
kmeans_k, inertias, silhouettes, ks = train_kmeans(X)
##print(kmeans_k, inertias, silhouettes, ks)
############### plot score
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(ks, silhouettes, "o-", color="grey", linewidth=2.5, markersize=5)
ax.set_xlabel("k", fontsize=14)
ax.set_ylabel("Silhouette", fontsize=14)
ax.tick_params(axis='both', labelsize=8)
ax.set_title("Silhouette Score", fontsize=18, fontweight="bold")
ax.grid(False)
plt.tight_layout()
plt.show()
#################### plot centroids with max scores
# model that is BEST acc. silhouettes SCORE
id = silhouettes.index(max(silhouettes))
kmeans= kmeans_k[id]
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.02 # point in the mesh [x_min, x_max]x[y_min, y_max].
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use BEST acc. silhouettes SCORE trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(
Z,
interpolation="nearest",
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect="auto",
origin="lower",
)
plt.plot(X[:, 0], X[:, 1], "k.", markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(
centroids[:, 0],
centroids[:, 1],
marker="x",
s=169,
linewidths=3,
color="w",
zorder=10,
)
plt.title(
"K-means clustering \n"
"Centroids are marked with white cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
p.s. MeanShift also do not need to define k
quantity as param for modelling, just to give bandwidth
(Should be between [0, 1] e.g. 0.5 means that the median of all pairwise distances is used)
from sklearn.cluster import MeanShift, estimate_bandwidth
# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=2000);
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
Upvotes: 0
Reputation: 77454
The "elbow" is not even well defined so how can it be reliable?
You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable. For example, the Calinski and Harabasz (1974) variance ratio criterion. It is essentially a rescaled version that makes much more sense.
Upvotes: 2