N3uralN3twork
N3uralN3twork

Reputation: 13

How to compute the Wemmert-Gancarski Index in python?

Problem:

I'm attempting to compute the Wemmert-Gancarski index for a given clustering solution in Python.

However, I am having trouble calculating the denominator for the $R(M)$ part of the index - 1.2.26 - as I can't seem to find a way to calculate the minimum distance between an observation and the centroids of the other clusters.

$R(M)$ is the quotient between the distance of a point $M$ to the barycenter of the cluster it belongs to and the minimum distance between a point and the barycenters of all the other clusters.

My Attempt:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances, pairwise_distances
from scipy.spatial.distance import pdist, cdist, squareform, euclidean
from sklearn.datasets import load_iris

iris = load_iris()
XIris = iris.data  # we only take the first two features.
yIris = iris.target

inter = []
intra = []
centroidsIn = np.array(np.zeros(shape=(k, XIris.shape[1])))
centroidsOut = np.array(np.zeros(shape=(k, XIris.shape[1])))
quotient = []
nk = np.bincount(yIris)

for i in range(n_labels):
    inCluster = XIris[yIris == i]
    outCluster = XIris[yIris != i]
    centroidsIn[i] = np.mean(inCluster, axis=0)
    centroidsOut[i] = np.mean(outCluster, axis=0)
    intra.append(cdist(inCluster, centroidsIn, 'euclidean'))
    inter.append(cdist(inCluster, centroidsOut, 'euclidean'))

quotient = np.divide(intra, inter)
print(quotient)

The true WG index for a k=3 clustering solution using Fisher's Iris dataset is 0.666.

Any tips are appreciated.

Upvotes: 1

Views: 120

Answers (1)

N3uralN3twork
N3uralN3twork

Reputation: 28

Update:

I managed to solve for the entire index, including the original problem above.

As it turns out, I didn't need to find the denominator itself, but rather, I just had to calculate the distance matrix from the points to each cluster.

Then, the smallest distance relates to the numerator and the second-smallest relates to the denominator.

After that, the computation of the WG index was straightforward.

Supporting Code:

def wemmert_gancarski_index(X, labels, n_clusters=None, min_nc=None):
"""
The Wemmert-Gancarski Index, a measure of compactness.

The W-G index is built using the quotients of distances between the points and the barycenters
of all of the clusters.

If the mean of the quotient is greater than :math:`1`, it is ignored, thus it is a weighted mean.

**Maximum value** indicates the optimal number of clusters.

Parameters
----------
X : array-like or dataframe, with **shape:** *(n_samples, n_features)*

    An array / dataframe of observations used to compute the W-G index.

labels : array-like, with **shape:** *(n_samples,)*

    An array / list of labels represented by integers.

n_clusters : int, optional

    The number of clusters to compute the index for.

Returns
-------
The Wemmert-Gancarski Index.

$
"""
# Checking for valid inputs:
def check(labels, n_clusters=None, min_nc=None):
    if n_clusters is None and (
            isinstance(labels, np.ndarray) or isinstance(labels, pd.DataFrame)) and min_nc is None:
        use_labels = labels
        return use_labels
    elif isinstance(labels, list) and n_clusters is not None and min_nc is not None:
        use_labels = self.get_labels(labels, n_clusters=n_clusters, min_nc=min_nc, need="Single")
        use_labels = np.asarray(use_labels)
        return use_labels
    else:
        raise ValueError(f"Please provide either an array of labels (without the other arguments) "
                            f"or (a list of labels, K, min_nc)")

use_labels = check(labels, n_clusters=n_clusters, min_nc=min_nc)

# Calculate the distance between each point and a cluster's centroid, given a dataframe and labels:
def dist_from_centroid(X, labels):
    centroids = centers2(X, labels)
    # Get the distance from each point to each centroid:
    distances = cdist(X, centroids, metric='euclidean')
    return distances

dists = dist_from_centroid(X, use_labels)
dists

intra, inter = [], []

for row in dists:
    inter.append(sorted(row)[1])  # Get the second smallest distance
    intra.append(np.min(row))  # Get the smallest distance

# Compute the quotient of distances between each point and a cluster's centroid:
RM = np.divide(intra, inter)

# Given a vector of shape (n_samples, 1) and the size of each cluster, nk, return a new array of shape (n_samples / nk, nk):
def chunk(vec, chunk_size):
    return np.array([vec[i:i + chunk_size] for i in range(0, len(vec), chunk_size)])

nk = len(np.unique(use_labels))
RMi = chunk(RM, nk)

# Compute 1 - the mean of the quotient of distances between each point and a cluster's centroid:
meanDiff = 1 - np.mean(RMi.transpose(), axis=1)

# Only select the values greater than 0:
Jk = meanDiff[meanDiff > 0]

WG = np.sum([i * j for i, j in zip(np.bincount(use_labels), Jk)]) / len(use_labels)

return WG

As it turns out, the answer just required looking at the quotient from a different perspective and everything else fell into place.

Upvotes: 1

Related Questions