EmJ
EmJ

Reputation: 4608

How to use precomputed distance matrix in new version of kmeans in sklearn?

I am computing my own distance matrix as follows and I want to use it for clustering.

import numpy as np
from math import pi

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]

#Assign shortest distances from each point
dist[((dist > pi) & (dist <= (2*pi)))] = dist[((dist > pi) & (dist <= (2*pi)))] -(2*pi)
dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] = dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] + (2*pi) 
dist = abs(dist)

#check dist
print(dist)

My distance matrix looks as follows.

[[0.         0.43633231 2.18166156 2.43909763 2.61799388]
 [0.43633231 0.         1.74532925 2.00276532 2.18166156]
 [2.18166156 1.74532925 0.         0.25743606 0.43633231]
 [2.43909763 2.00276532 0.25743606 0.         0.17889625]
 [2.61799388 2.18166156 0.43633231 0.17889625 0.        ]]

I want to have 2 clusters (e.g., cluster 1: 0,1 and cluster 2: 2,3,4) using kmeans for above precomputed distance matrix.

When I check kmeans documentation it seeems like precomputed distances are deprecated -> precompute_distances='deprecated'.

Link to documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

I am wondering what are the other options that I can look into to perform kmeans using my precomputed distance matrix.

I am happy to provide more details if needed

Upvotes: 2

Views: 6407

Answers (2)

Michael Green
Michael Green

Reputation: 810

Do you really want to use your own distance matrix for clustering if you're going to end up feeding the results to sklearn anyways? If not, then you can use KMeans on your dataset directly by reshaping your points matrix to a (-1, 1) array (numpy uses -1 as a sort of filler to return a reshape of the length of the original axis)

import numpy as np
from math import pi
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

lbls = KMeans(n_clusters=2).fit_predict(points_rad.reshape((-1,1)))
print(lbls) # prints the following: [0 0 1 1 1]

fig, ax = plt.subplots()

ax.scatter(points_rad, points_rad, c=lbls)

plt.show()

enter image description here

Upvotes: 1

Ben Reiniger
Ben Reiniger

Reputation: 12602

kMeans needs distances to the centroids ("means") of the clusters (at each iteration), not the pairwise distances between points. So unlike e.g. k-nearest-neighbors, having this data precomputed won't help*. The meaning of the deprecated parameter here precompute_distances was instead whether to compute all the point-center distances first, or in-loop; for details see PR11950. That PR made a performance enhancement that obviated the need for this parameter.

* Well, I could see perhaps that there could be a speedup if the data were put into a search structure like BallTree (again see k-neighbors) so that not all the point-centroid distances needed to be computed; but it's not clear how much this could help, and would only really be useful when k was quite large I think. At any rate, that's not being done here.

Upvotes: 4

Related Questions