How to use precomputed distance matrix in new version of kmeans in sklearn?

Question

I am computing my own distance matrix as follows and I want to use it for clustering.

import numpy as np
from math import pi

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]

#Assign shortest distances from each point
dist[((dist > pi) & (dist <= (2*pi)))] = dist[((dist > pi) & (dist <= (2*pi)))] -(2*pi)
dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] = dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] + (2*pi) 
dist = abs(dist)

#check dist
print(dist)

My distance matrix looks as follows.

[[0.         0.43633231 2.18166156 2.43909763 2.61799388]
 [0.43633231 0.         1.74532925 2.00276532 2.18166156]
 [2.18166156 1.74532925 0.         0.25743606 0.43633231]
 [2.43909763 2.00276532 0.25743606 0.         0.17889625]
 [2.61799388 2.18166156 0.43633231 0.17889625 0.        ]]

I want to have 2 clusters (e.g., cluster 1: 0,1 and cluster 2: 2,3,4) using kmeans for above precomputed distance matrix.

When I check kmeans documentation it seeems like precomputed distances are deprecated -> precompute_distances='deprecated'.

Link to documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

I am wondering what are the other options that I can look into to perform kmeans using my precomputed distance matrix.

I am happy to provide more details if needed

Michael Green · Accepted Answer

Do you really want to use your own distance matrix for clustering if you're going to end up feeding the results to sklearn anyways? If not, then you can use KMeans on your dataset directly by reshaping your points matrix to a (-1, 1) array (numpy uses -1 as a sort of filler to return a reshape of the length of the original axis)

import numpy as np
from math import pi
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

lbls = KMeans(n_clusters=2).fit_predict(points_rad.reshape((-1,1)))
print(lbls) # prints the following: [0 0 1 1 1]

fig, ax = plt.subplots()

ax.scatter(points_rad, points_rad, c=lbls)

plt.show()

How to use precomputed distance matrix in new version of kmeans in sklearn?

Answers (2)

Related Questions