Reputation: 771

Calculating medoid of a cluster (Python)

So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.

I'm employing a fractional distance metric in order to calculate distances:

where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1

I would then calculate the medoid as:

where S is the set of datapoints, and δ is the absolute value of the distance metric used above.

I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.

Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:

I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
I know numpy has an argmin() function (documented here).
Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).

Upvotes: 4

Answers (6)

CpILL

Reputation: 6999

Would a naïve (but simple) approach be to just average the vectors (centroid) in the cluster and then find the vector in the cluster closes to this centroid?

import numpy as np
import faiss

vectors = np.array([...], dtype=np.float32)

# Using FAISS to build the vector index

vector_dimension = vectors.shape[1] # assume vectors
index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
index.add(vectors)

# find the mediod, closest to centroid?

centroid = np.array(vectors).mean(axis=0)
faiss.normalize_L2(centroid)
distances, indices = index.search(centroid, 2)

medoid = vectors[indices[0][0]]

Upvotes: 0

Oleg Melnikov

Reputation: 3298

Here is an example of computing a medoid for a single cluster with Euclidean distance.

import numpy as np, pandas as pd, matplotlib.pyplot as plt
a, b, c, d = np.array([0,1]), np.array([1, 3]), np.array([4,2]), np.array([3, 1.5])
vCenroid = np.mean([a, b, c, d], axis=0)

def GetMedoid(vX):
  vMean = np.mean(vX, axis=0)                               # compute centroid
  return vX[np.argmin([sum((x - vMean)**2) for x in vX])]   # pick a point closest to centroid

vMedoid = GetMedoid([a, b, c, d])

print(f'centroid = {vCenroid}')
print(f'medoid = {vMedoid}')

df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
plt.plot(vCenroid[0], vCenroid[1], 'ro', ms=10);   # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20);     # plot medoid as red star

You can also use the following package to compute medoid for one or more clusters

!pip -q install scikit-learn-extra > log
from sklearn_extra.cluster import KMedoids
GetMedoid = lambda vX: KMedoids(n_clusters=1).fit(vX).cluster_centers_
GetMedoid([a, b, c, d])[0]

Upvotes: 1

Pablo Ruiz Ruiz

Reputation: 636

I would say that you just need to compute the median.
np.median(np.asarray(points), axis=0)

Your median is the point with the biggest centrality.
Note: if you are using distances different than Euclidean this doesn't hold.

Upvotes: -3

Yonatan Simson

Reputation: 2575

If you don't mind using brute force this might help:

def calc_medoid(X, Y, f=2):
    n = len(X)
    m = len(Y)
    dist_mat = np.zeros((m, n))
    # compute distance matrix
    for j in range(n):
        center = X[j, :]
        for i in range(m):
            if i != j:
                dist_mat[i, j] = np.linalg.norm(Y[i, :] - center, ord=f)

    medoid_id = np.argmin(dist_mat.sum(axis=0))  # sum over y

    return medoid_id, X[medoid_id, :]

Upvotes: 0

user3684314

Reputation: 771

So I've accepted the answer here, but I thought I'd provide my implementation if anyone else was trying to do something similar:

(1) This is the distance function:

def fractional(p_coord_array, q_coord_array):
  # f is an arbitrary value, but must be greater than zero and 
  # less than one. In this case, I used 3/10. I took advantage
  # of the difference of cubes in this case, so that I wouldn't
  # encounter an overflow error.

  a = np.sum(np.array(p_coord_array, dtype=np.float64))
  b = np.sum(np.array(q_coord_array, dtype=np.float64))
  a2 = np.sum(np.power(p_coord_array, 2))
  ab = np.sum(p_coord_array) * np.sum(q_coord_array)
  b2 = np.sum(np.power(p_coord_array, 2))
  diffab = a - b
  suma2abb2 = a2 + ab + b2

  temp_dist = abs(diffab * suma2abb2)
  temp_dist = np.power(temp_dist, 1./10)

  dist = np.power(temp_dist, 10./3)
  return dist

(2) The medoid function (if the length of the dataset was less than 6000 [if greater than that, I ran into overflow errors... I'm still working on that bit to be perfectly honest...]):

def medoid(dataset):

  point = []
  w = len(dataset)

  if(len(dataset) < 6000):
    h = len(dataset)
    dist_matrix = [[0 for x in range(w)] for y in range(h)]

    list_combinations = [(counter_1, counter_2, data_1, data_2) for counter_1, data_1 in enumerate(dataset) for counter_2, data_2 in enumerate(dataset) if counter_1 < counter_2]

    for counter_3, tuple in enumerate(list_combinations):
      temp_dist = fractional(tuple[2], tuple[3])
      dist_matrix[tuple[0]][tuple[1]] = abs(temp_dist)
      dist_matrix[tuple[1]][tuple[0]] = abs(temp_dist)

Any questions, feel free to comment!

Upvotes: 5

Has QUIT--Anony-Mousse

Reputation: 77474

compute pairwise distance matrix
compute column or row sum
argmin to find medoid index

i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.

Upvotes: 18

Calculating medoid of a cluster (Python)

Answers (6)

Related Questions