Reputation: 13
I'm attempting to compute the Wemmert-Gancarski index for a given clustering solution in Python.
However, I am having trouble calculating the denominator for the $R(M)$ part of the index - 1.2.26 - as I can't seem to find a way to calculate the minimum distance between an observation and the centroids of the other clusters.
$R(M)$ is the quotient between the distance of a point $M$ to the barycenter of the cluster it belongs to and the minimum distance between a point and the barycenters of all the other clusters.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances, pairwise_distances
from scipy.spatial.distance import pdist, cdist, squareform, euclidean
from sklearn.datasets import load_iris
iris = load_iris()
XIris = iris.data # we only take the first two features.
yIris = iris.target
inter = []
intra = []
centroidsIn = np.array(np.zeros(shape=(k, XIris.shape[1])))
centroidsOut = np.array(np.zeros(shape=(k, XIris.shape[1])))
quotient = []
nk = np.bincount(yIris)
for i in range(n_labels):
inCluster = XIris[yIris == i]
outCluster = XIris[yIris != i]
centroidsIn[i] = np.mean(inCluster, axis=0)
centroidsOut[i] = np.mean(outCluster, axis=0)
intra.append(cdist(inCluster, centroidsIn, 'euclidean'))
inter.append(cdist(inCluster, centroidsOut, 'euclidean'))
quotient = np.divide(intra, inter)
print(quotient)
The true WG index for a k=3 clustering solution using Fisher's Iris dataset is 0.666.
Any tips are appreciated.
Upvotes: 1
Views: 120
Reputation: 28
I managed to solve for the entire index, including the original problem above.
As it turns out, I didn't need to find the denominator itself, but rather, I just had to calculate the distance matrix from the points to each cluster.
Then, the smallest distance relates to the numerator and the second-smallest relates to the denominator.
After that, the computation of the WG index was straightforward.
def wemmert_gancarski_index(X, labels, n_clusters=None, min_nc=None):
"""
The Wemmert-Gancarski Index, a measure of compactness.
The W-G index is built using the quotients of distances between the points and the barycenters
of all of the clusters.
If the mean of the quotient is greater than :math:`1`, it is ignored, thus it is a weighted mean.
**Maximum value** indicates the optimal number of clusters.
Parameters
----------
X : array-like or dataframe, with **shape:** *(n_samples, n_features)*
An array / dataframe of observations used to compute the W-G index.
labels : array-like, with **shape:** *(n_samples,)*
An array / list of labels represented by integers.
n_clusters : int, optional
The number of clusters to compute the index for.
Returns
-------
The Wemmert-Gancarski Index.
$
"""
# Checking for valid inputs:
def check(labels, n_clusters=None, min_nc=None):
if n_clusters is None and (
isinstance(labels, np.ndarray) or isinstance(labels, pd.DataFrame)) and min_nc is None:
use_labels = labels
return use_labels
elif isinstance(labels, list) and n_clusters is not None and min_nc is not None:
use_labels = self.get_labels(labels, n_clusters=n_clusters, min_nc=min_nc, need="Single")
use_labels = np.asarray(use_labels)
return use_labels
else:
raise ValueError(f"Please provide either an array of labels (without the other arguments) "
f"or (a list of labels, K, min_nc)")
use_labels = check(labels, n_clusters=n_clusters, min_nc=min_nc)
# Calculate the distance between each point and a cluster's centroid, given a dataframe and labels:
def dist_from_centroid(X, labels):
centroids = centers2(X, labels)
# Get the distance from each point to each centroid:
distances = cdist(X, centroids, metric='euclidean')
return distances
dists = dist_from_centroid(X, use_labels)
dists
intra, inter = [], []
for row in dists:
inter.append(sorted(row)[1]) # Get the second smallest distance
intra.append(np.min(row)) # Get the smallest distance
# Compute the quotient of distances between each point and a cluster's centroid:
RM = np.divide(intra, inter)
# Given a vector of shape (n_samples, 1) and the size of each cluster, nk, return a new array of shape (n_samples / nk, nk):
def chunk(vec, chunk_size):
return np.array([vec[i:i + chunk_size] for i in range(0, len(vec), chunk_size)])
nk = len(np.unique(use_labels))
RMi = chunk(RM, nk)
# Compute 1 - the mean of the quotient of distances between each point and a cluster's centroid:
meanDiff = 1 - np.mean(RMi.transpose(), axis=1)
# Only select the values greater than 0:
Jk = meanDiff[meanDiff > 0]
WG = np.sum([i * j for i, j in zip(np.bincount(use_labels), Jk)]) / len(use_labels)
return WG
As it turns out, the answer just required looking at the quotient from a different perspective and everything else fell into place.
Upvotes: 1