Reputation: 17
I've been having trouble with this for a while and I just cannot seem to find a way to get the number of data points within a specific cluster. Here's what I have so far:
This first chunk outputs the number of data points in each of my 8 clusters:
def CountFrequency(my_list):
freq = {}
for item in my_list:
if (item in freq):
freq[item] += 1
else:
freq[item] = 1
for key, value in freq.items():
print ("% d : % d"%(key, value))
def clusterCounts(df):
df3 = df.fillna(df.mean())
array3 = df3[['column1', 'column2', 'column3']].values
kmeans = KMeans(n_clusters=8, random_state=42)
kmeans.fit(array3)
return CountFrequency(kmeans.labels_)
Which results in:
1 : 26625
6 : 2562
2 : 9892
7 : 2165
3 : 1633
0 : 3072
4 : 1228
5 : 4315
None
(Not sure why the None
is there but that's a minor issue I think)
My next code chunk prints the centroid for each of my 8 clusters:
def clusters(df):
df3 = df.fillna(df.mean())
array3 = df3[['column1', 'column2', 'column3']].values
kmeans = KMeans(n_clusters=8, random_state=42)
kmeans.fit(array3)
kmeans.labels_
clusters = kmeans.cluster_centers_
return clusters
Results in:
[[49.2 2.4 48.4]
[18.9 18.9 62.1]
[ 0.2 0.4 99.4]
[ 1.1 98.3 0.6]
[98.2 1. 0.9]
[33.3 32.7 34. ]
[27. 1.2 71.7]
[ 3.6 51.9 44.5]]
I am trying to find a way to find out how many data points are in the cluster with the [33.3 32.7 34. ]
centroid. How can I isolate this centroid's cluster in order to get the number of data points it contains? As a secondary question, do the keys in the first results code chunk I posted (the one with the # of data points per cluster) align with the order of the centroids above at all? I hope this is clear and thank you in advance!
Upvotes: 0
Views: 1461
Reputation: 77505
Why don't you do a simple
for i in range(len(kmeans.cluster_centers)):
print("Cluster", i)
print("Center:", kmeans.cluster_centers_[i])
print("Size:", sum(kmeans.labels_ == i))
Since TRUE will be a 1 and FALSE is 0.
Upvotes: 1