M. Yates
M. Yates

Reputation: 17

Finding the size of a specific k-means cluster

I've been having trouble with this for a while and I just cannot seem to find a way to get the number of data points within a specific cluster. Here's what I have so far:

This first chunk outputs the number of data points in each of my 8 clusters:

 def CountFrequency(my_list):  
    freq = {} 
    for item in my_list: 
        if (item in freq): 
            freq[item] += 1
        else: 
            freq[item] = 1

    for key, value in freq.items(): 
        print ("% d : % d"%(key, value)) 
​
def clusterCounts(df):

    df3 = df.fillna(df.mean())
    array3 = df3[['column1', 'column2', 'column3']].values
    kmeans = KMeans(n_clusters=8, random_state=42) 
    kmeans.fit(array3)
    return CountFrequency(kmeans.labels_) 

Which results in:

 1 :  26625
 6 :  2562
 2 :  9892
 7 :  2165
 3 :  1633
 0 :  3072
 4 :  1228
 5 :  4315
 None

(Not sure why the None is there but that's a minor issue I think)

My next code chunk prints the centroid for each of my 8 clusters:

def clusters(df):

    df3 = df.fillna(df.mean())
    array3 = df3[['column1', 'column2', 'column3']].values
    kmeans = KMeans(n_clusters=8, random_state=42) 
    kmeans.fit(array3)
    kmeans.labels_
    clusters = kmeans.cluster_centers_
    return clusters

Results in:

[[49.2  2.4 48.4]
 [18.9 18.9 62.1]
 [ 0.2  0.4 99.4]
 [ 1.1 98.3  0.6]
 [98.2  1.   0.9]
 [33.3 32.7 34. ]
 [27.   1.2 71.7]
 [ 3.6 51.9 44.5]]

I am trying to find a way to find out how many data points are in the cluster with the [33.3 32.7 34. ] centroid. How can I isolate this centroid's cluster in order to get the number of data points it contains? As a secondary question, do the keys in the first results code chunk I posted (the one with the # of data points per cluster) align with the order of the centroids above at all? I hope this is clear and thank you in advance!

Upvotes: 0

Views: 1461

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

Why don't you do a simple

for i in range(len(kmeans.cluster_centers)):
  print("Cluster", i)
  print("Center:", kmeans.cluster_centers_[i])
  print("Size:", sum(kmeans.labels_ == i))

Since TRUE will be a 1 and FALSE is 0.

Upvotes: 1

Related Questions