taga
taga

Reputation: 3885

Whats the number of data in each K-Means Cluster

I have wrote a code that will give me the best number of clusters based on max value of silhouette_score. Now I want to find out how many values each cluster have. For example, my result is that the optimal number of clusters is 3, I want to find out how many values each cluster have, for example first cluster has 1241 values second 3134 values and third 351 values. Is it possible to do something like that?

import pandas as pd
import matplotlib.pyplot as plt
import re 
from sklearn.preprocessing import scale

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.cluster import KMeans, MiniBatchKMeans, AffinityPropagation

from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics.cluster import adjusted_mutual_info_score

from sklearn.decomposition import PCA

df = pd.read_csv('CNN Comments.csv')
df = df.head(8000)
#print(df)
x = df['Text Data']

cv = TfidfVectorizer(analyzer = 'word',max_features = 10000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
#cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')  

x = cv.fit_transform(x)

my_list = []
list_of_clusters = []
for i in range(2,5):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)

    cluster_labels = kmeans.fit_predict(x)

    silhouette_avg = silhouette_score(x, cluster_labels) * 100
    print(round(silhouette_avg,2))
    list_of_clusters.append(round(silhouette_avg, 1))


plt.plot(range(2,5),my_list)
plt.show()


number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2

print('Number of clusters: ', number_of_clusters)

Upvotes: 0

Views: 1292

Answers (2)

PV8
PV8

Reputation: 6260

The alternativ with numpy:

import numpy as np
...
unique, counts = np.unique(kmeans.fit_predict(x), return_counts=True)
print(dict(zip(unique, counts)))

Upvotes: 0

James
James

Reputation: 36608

You can use the array assigned to cluster_labels to get the distribution of cluster assignments. I would recommend using Counter from the collections module.

from collections import Counter

...

cluster_labels = kmeans.fit_predict(x)
cluster_counts = Counter(cluster_labels)

Upvotes: 3

Related Questions