Reputation: 511
I figured that sklearn kmeans uses imaginary points as cluster centroids.
So far, I found no option to use real data points as centroids in sklearn.
I am currently calculating the data point that is closest to a centroid but thought there might be an easier way.
I am not necessarily restricted to kmeans by the way.
A google search around clustering with real data centroids wasn't fruitful either.
Did anyone have the same problem before?
import numpy as np
from sklearn.cluster import KMeans
import math
def distance(a, b):
dist = math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)
return dist
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
print(np.where(xy == centroids[0])[0])
for c in centroids:
nearest = min(xy, key=lambda x: distance(x, c))
print('centroid', c)
print('nearest data point to centroid', nearest)
Upvotes: 1
Views: 4660
Reputation: 11
After three years, this question remains unanswered. If anyone finds themselves in the same situation, what you are looking for is the kmedoids algorithm. This is also implemented by scikit learn, just make sure to use from sklearn_extra.cluster import KMedoids
instead of from sklearn.cluster import KMeans
.
Upvotes: 1
Reputation: 609
I've been through the same question, how to find the sample within each cluster that minimizes inertia. I made this function :
import numpy as np
from sklearn.metrics import pairwise_distances_chunked
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
s = []
for _ in pairwise_distances_chunked(X=X[mask]):
s.append(np.square(_).sum(axis=1))
ret.append(mask[np.argmin(np.concatenate(s))])
return np.array(ret)
And it can be used like this :
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)
km = KMeans(n_clusters=3, random_state=0).fit(X)
index_representative_points(km, X)
>>> array([89, 25, 28], dtype=int64)
EDIT : For very large datasets, the function is very slow. But it can be proven that the point within the cluster that minimizes the inertia is the closest one of the centroid. Hence, this second version :
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
centroid = np.mean(X[mask], axis=0)
i0 = mask[pairwise_distances_argmin(centroid[None, :], X[mask])[0]]
ret.append(i0)
return np.array(ret)
Upvotes: 0
Reputation: 6505
Centroids does not have to be points in your set. Since you are in a 2d space, you will find centroids with 2d coordinates. If you want to print distances between each centroid and each point you can:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
for centroid in centroids:
print(f'List of distances between centroid {centroid} and each point:\n\
{np.linalg.norm(centroid-xy, axis=1)}\n')
List of distances between centroid [0.87236496 0.74034618] and each point:
[0.21056113 0.84946149 0.83381298 0.31347176 0.40811323 0.85442416
0.44043437 0.66736601 0.55282619 0.14813826]
List of distances between centroid [0.37243631 0.37851987] and each point:
[0.77005698 0.29192851 0.25249753 0.60881231 0.2219568 0.24264077
0.27374379 0.39968813 0.31728732 0.58604271]
As you can see we have that prediction corresponds to the centroid to which the distance is minimal:
kmeans.predict(xy)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
distances = np.vstack([np.linalg.norm(centroids[0]-xy, axis=1),
np.linalg.norm(centroids[1]-xy, axis=1)])
distances.argmin(axis=0)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
Let's plot the data: centroids are square shaped and points are circle shaped, which size is the inverse proportional to the distance from its centroid.
Now although the figure is plotting other random data points, I hope it helps.
Upvotes: 1
Reputation: 4912
Actually sklearn.cluster.KMeans
allows now to use custom centroids.
see init
section here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
or in source code for sklearn.kmneans here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/cluster/_kmeans.py#L649
"If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers."
I hope that it works. Please try.
Upvotes: 1