ramailo sathi
ramailo sathi

Reputation: 355

Finding the indices of all points corresponding to a particular centroid using kmeans clustering

Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500):

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

#ignore this, just labelling each point in cluster
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
    label, 
   xy = (x, y), xytext = (-20, 20),
   textcoords = 'offset points', ha = 'right', va = 'bottom',
   bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
   arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

I am trying to find the indices for all of the points within each cluster.image without labels

Upvotes: 0

Views: 1702

Answers (2)

ali_m
ali_m

Reputation: 74154

In this line:

idx,_ = vq(data,centroids)

you have already generated a vector containing the index of the nearest centroid for each point (row) in your data array.

It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use np.nonzero to find the indices where idx == i where i is the centroid you are interested in.

For example:

in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]

In the comments you also asked why the idx vector differs across runs. This is because if you pass an integer as the second parameter to kmeans, the centroid locations are randomly initialized (see here).

Upvotes: 2

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You already have that...

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')

Guess what idx does, and what data[idx==0] vs. data[idx==1] contain.

Upvotes: 1

Related Questions