Giles
Giles

Reputation: 1687

Why does scikit-learn silhouette_score return an error for 1 cluster?

One cluster (K = 1) is a possible valid, best fit, for different values of K in K-means clustering. "silhouette_score" in scikit-learn (v 0.23.1) does not seem to work with one cluster and gives an unexpected error.

Here is the code to reproduce:

import numpy as np
from sklearn.metrics import silhouette_samples, silhouette_score

X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = kmeans.predict(X)
print(labels)
print(silhouette_score(X, labels)) # 2 clusters works

kmeans = KMeans(n_clusters=1, random_state=0).fit(X)
labels = kmeans.predict(X)
print(labels)
print(silhouette_score(X, labels)) # 1 cluster gives ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

The correct value of silhouette score for 1 cluster should be zero according to this.

Am I doing something wrong here?

Upvotes: 3

Views: 3174

Answers (1)

mabergerx
mabergerx

Reputation: 1213

In the silhouette_score documentation, the score is defined in terms of the silhouette_coefficient in the following way:

Compute the mean Silhouette Coefficient of all samples.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

So, in their implementation, they use b, which is the distance between a sample and the nearest cluster that the sample is not a part of. By definition, this will not work if there is only one cluster, which does not allow for samples to be defined in multiple clusters. Therefore, they set a constraint on the number of clusters to be larger than 1.

Moreover, if it is mathematically defined that a Silhouette score should be 0 for 1 cluster, then why would you want to calculate it in the first place?

Upvotes: 2

Related Questions