Reputation: 31

Agglomerative clustering from custom pairwise distance function

thanks to insightful comments on this thread: Pairwise Wasserstein distance on 2 arrays, I was able to come up with a custom function to find a distance metric between a set of 2 dimensional arrays (10 points, with x-, y-coordinates). My next step is to find a way to feed this information into an agglomerative clustering algorithm, such as the fcluster() method of the scipy.cluster.hierarchy module.

More specifically, I would like to use the following functions to find ideally a set of n clusters for a 3 dimensional array of data. I am not sure how to tweak the pairwise-wasserstein function to retrieve the distance matrix that fcluster needs to find the cluster assignments agglomeratively.

Thanks for any ideas in advance!

import numpy as np
from scipy.optimize import linear_sum_assignment
from scipy.cluster.hierarchy import dendrogram, linkage, ward
from scipy.cluster.hierarchy import fcluster

data = np.array([[[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]],
                 [[5, 6], [7, 8], [5, 6], [7, 8], [5, 6], [7, 8], [5, 6], [7, 8], [5, 6], [7, 8]],
                 [[1, 15], [3, 2], [1, 2], [5, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]],
                 [[5, 1], [7, 8], [5, 6], [7, 1], [5, 6], [7, 8], [5, 1], [7, 8], [5, 6], [7, 8]]])


def wasserstein_distance_function(f1, f2):
    min_cost = np.inf
    f1 = f1.reshape((10, 2))
    f2 = f2.reshape((10, 2))
    for l in np.linspace(0.8, 1.2, 3):
        for k in np.linspace(0.8, 1.2, 3):
            cost = distance.cdist(l * f1, k * f2, 'sqeuclidean')
            row_ind, col_ind = linear_sum_assignment(cost)
            curr_cost = cost[row_ind, col_ind].sum()
            if curr_cost < min_cost:
                min_cost = curr_cost
    return min_cost

def pairwise_wasserstein(points):
    """
    Helper function to perform the pairwise distance function of all points within 'points' parameter
    """
    for first_index in range(0,points.shape[0]):
      for second_index in range(first_index+1,points.shape[0]):
        print("First index: ", first_index, ", Second index: ", second_index, ", Distance: ",wasserstein_distance_function(points[first_index],points[second_index]))

def find_clusters_formation(data):
    """
    Method to find the clusters for the points array
    """
    dist_mat = pairwise_wasserstein(data)
    Z = ward(dist_mat)
    cluster = fcluster(Z, 3, criterion='maxclust')

Upvotes: 2

Answers (2)

SkogensKonung

Reputation: 673

If you want to use a predefined metric, you have to create a distance matrix, which is a quadratic matrix that has 0s on the diagonal. Of course, the reason why it has zeros on its diagonal is: the distance of a point to itself is zero. This matrix is than passed as a parameter to the fit_predict function of a clustering algorithm.

First step - create a distance matrix and calculate the distance between data points:

distance_matrix = np.asarray([
    [wasserstein_distance_function(data[first_index], data[second_index]) 
         for first_index in range(len(data))] 
             for second_index in range(len(data))])

This prints the following:

array([[  0.  , 100.8 ,  76.4 ,  96.32],
       [100.8 ,   0.  , 215.  ,  55.68],
       [ 76.4 , 215.  ,   0.  , 186.88],
       [ 96.32,  55.68, 186.88,   0.  ]])

Second step - fill the clustering algorithm with paramaters as you wish:

clusterer = AgglomerativeClustering(n_clusters=3, affinity="precomputed", linkage="average", distance_threshold=None)

Third step - extract the labels:

clusterer.fit_predict(distance_matrix)

This prints:

array([2, 0, 1, 0], dtype=int64)

Does it achieve what you wanted?

Upvotes: 1

m1gnoc

Reputation: 31

Update:

I might have gotten it to work by fitting in a [1, 20] array of all 10 players x and y coordinates combined in the form: [x1, y1, x2, y2, ..., x10, y10] and then reshaped them as shown above in the wasserstein_distance_function.

I'm not 100% sure yet if that works, but the first results seem promising (i.e. decently balanced clusters).

Upvotes: 0

Agglomerative clustering from custom pairwise distance function

Answers (2)

Related Questions