Reputation: 355
Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500):
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
#ignore this, just labelling each point in cluster
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
label,
xy = (x, y), xytext = (-20, 20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
I am trying to find the indices for all of the points within each cluster.
Upvotes: 0
Views: 1702
Reputation: 74154
In this line:
idx,_ = vq(data,centroids)
you have already generated a vector containing the index of the nearest centroid for each point (row) in your data
array.
It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use np.nonzero
to find the indices where idx == i
where i
is the centroid you are interested in.
For example:
in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]
In the comments you also asked why the idx
vector differs across runs. This is because if you pass an integer as the second parameter to kmeans
, the centroid locations are randomly initialized (see here).
Upvotes: 2
Reputation: 77454
You already have that...
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
Guess what idx
does, and what data[idx==0]
vs. data[idx==1]
contain.
Upvotes: 1