Manually find the distance between centroid and labelled data points

Question

I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:

import numpy as np

# 10 random points in 3D space
X = np.random.rand(10,3)

# define the number of clusters, say 3
clusters = 3

# give each point a random label 
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)

# randomly assign location of centroids 
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)

# calculate distances
distances = []
for i in range(len(X)):
    distances.append(np.linalg.norm(X[i]-c[y[i][0]]))

Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.

v0rtex20k · Accepted Answer

Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:

distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)

will do the same thing as your original for loop.

EDIT: Thanks @Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by @Kris for their specific application.

Manually find the distance between centroid and labelled data points

Answers (1)

Related Questions