gimba
gimba

Reputation: 511

python kmeans clustering real data centroids

I figured that sklearn kmeans uses imaginary points as cluster centroids.

So far, I found no option to use real data points as centroids in sklearn.

I am currently calculating the data point that is closest to a centroid but thought there might be an easier way.

I am not necessarily restricted to kmeans by the way.

A google search around clustering with real data centroids wasn't fruitful either.

Did anyone have the same problem before?

import numpy as np
from sklearn.cluster import KMeans
import math

def distance(a, b):
    dist = math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)
    return dist

x = np.random.rand(10)
y = np.random.rand(10)

xy = np.array((x,y)).T

kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids  = kmeans.cluster_centers_

print(np.where(xy == centroids[0])[0])

for c in centroids:
    nearest = min(xy, key=lambda x: distance(x, c))
    print('centroid', c)
    print('nearest data point to centroid', nearest)

Upvotes: 1

Views: 4660

Answers (4)

GuyIncognito
GuyIncognito

Reputation: 11

After three years, this question remains unanswered. If anyone finds themselves in the same situation, what you are looking for is the kmedoids algorithm. This is also implemented by scikit learn, just make sure to use from sklearn_extra.cluster import KMedoids instead of from sklearn.cluster import KMeans.

Upvotes: 1

cyril
cyril

Reputation: 609

I've been through the same question, how to find the sample within each cluster that minimizes inertia. I made this function :

import numpy as np
from sklearn.metrics import pairwise_distances_chunked


def index_representative_points(km, X):
    ret = []
    for k in range(km.n_clusters):
        mask = (km.labels_ == k).nonzero()[0]
        s = []
        for _ in pairwise_distances_chunked(X=X[mask]):
            s.append(np.square(_).sum(axis=1))
        ret.append(mask[np.argmin(np.concatenate(s))])
    return np.array(ret)

And it can be used like this :

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)

km = KMeans(n_clusters=3, random_state=0).fit(X)
index_representative_points(km, X)
>>> array([89, 25, 28], dtype=int64)

EDIT : For very large datasets, the function is very slow. But it can be proven that the point within the cluster that minimizes the inertia is the closest one of the centroid. Hence, this second version :

def index_representative_points(km, X):
    ret = []
    for k in range(km.n_clusters):
        mask = (km.labels_ == k).nonzero()[0]
        centroid = np.mean(X[mask], axis=0)
        i0 = mask[pairwise_distances_argmin(centroid[None, :], X[mask])[0]]
        ret.append(i0)
    return np.array(ret)

Upvotes: 0

FBruzzesi
FBruzzesi

Reputation: 6505

Centroids does not have to be points in your set. Since you are in a 2d space, you will find centroids with 2d coordinates. If you want to print distances between each centroid and each point you can:

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

x = np.random.rand(10)
y = np.random.rand(10)

xy = np.array((x,y)).T

kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids  = kmeans.cluster_centers_

for centroid in centroids:
    print(f'List of distances between centroid {centroid} and each point:\n\
          {np.linalg.norm(centroid-xy, axis=1)}\n')

List of distances between centroid [0.87236496 0.74034618] and each point:
          [0.21056113 0.84946149 0.83381298 0.31347176 0.40811323 0.85442416
 0.44043437 0.66736601 0.55282619 0.14813826]

List of distances between centroid [0.37243631 0.37851987] and each point:
          [0.77005698 0.29192851 0.25249753 0.60881231 0.2219568  0.24264077
 0.27374379 0.39968813 0.31728732 0.58604271]

As you can see we have that prediction corresponds to the centroid to which the distance is minimal:

kmeans.predict(xy)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])


distances = np.vstack([np.linalg.norm(centroids[0]-xy, axis=1),
                     np.linalg.norm(centroids[1]-xy, axis=1)])
distances.argmin(axis=0)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])

Let's plot the data: centroids are square shaped and points are circle shaped, which size is the inverse proportional to the distance from its centroid.

Now although the figure is plotting other random data points, I hope it helps.

enter image description here

Upvotes: 1

Poe Dator
Poe Dator

Reputation: 4912

Actually sklearn.cluster.KMeans allows now to use custom centroids. see init section here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html or in source code for sklearn.kmneans here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/cluster/_kmeans.py#L649

"If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers."

I hope that it works. Please try.

Upvotes: 1

Related Questions