Adrian_G
Adrian_G

Reputation: 163

Define cluster centers manually

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center? For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.

something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?

to work around my problem thats what I do atm:

number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)

it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)

Edit to be more specific about my task:

So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense

I am using sklearn kmeans atm

Upvotes: 3

Views: 4859

Answers (1)

53RT
53RT

Reputation: 810

I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.

The parameter you are looking for is the k-Means initialization named as init see documentation.

I have prepared a small example that would do exactly this.

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix

# 5 datapoints with 3 features
data = [[1, 0, 0],
        [1, 0.2, 0],
        [0, 0, 1],
        [0, 0, 0.9],
        [1, 0, 0.1]]

X = np.array(data)

distance_matrix(X,X)

The pairwise distance matrix shows which examples are the closests.

> array([[0.        , 0.2       , 1.41421356, 1.3453624 , 0.1       ],
>       [0.2       , 0.        , 1.42828569, 1.36014705, 0.2236068 ],
>       [1.41421356, 1.42828569, 0.        , 0.1       , 1.3453624 ],
>       [1.3453624 , 1.36014705, 0.1       , 0.        , 1.28062485],
>       [0.1       , 0.2236068 , 1.3453624 , 1.28062485, 0.        ]])

you can select certain data points to be used as your initial centroids

centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
                 # [0. 0. 1.]]

kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated

kmeans.fit(X)
kmeans.labels_

>>> array([0, 0, 1, 1, 0], dtype=int32)

As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

Upvotes: 6

Related Questions