Reputation: 3885
I have wrote a code that will give me the best number of clusters based on max value of silhouette_score
. Now I want to find out how many values each cluster have. For example, my result is that the optimal number of clusters is 3, I want to find out how many values each cluster have, for example first cluster has 1241 values second 3134 values and third 351 values.
Is it possible to do something like that?
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import scale
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.cluster import KMeans, MiniBatchKMeans, AffinityPropagation
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics.cluster import adjusted_mutual_info_score
from sklearn.decomposition import PCA
df = pd.read_csv('CNN Comments.csv')
df = df.head(8000)
#print(df)
x = df['Text Data']
cv = TfidfVectorizer(analyzer = 'word',max_features = 10000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
#cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
my_list = []
list_of_clusters = []
for i in range(2,5):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels) * 100
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,5),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
Upvotes: 0
Views: 1292
Reputation: 6260
The alternativ with numpy:
import numpy as np
...
unique, counts = np.unique(kmeans.fit_predict(x), return_counts=True)
print(dict(zip(unique, counts)))
Upvotes: 0
Reputation: 36608
You can use the array assigned to cluster_labels
to get the distribution of cluster assignments. I would recommend using Counter
from the collections module.
from collections import Counter
...
cluster_labels = kmeans.fit_predict(x)
cluster_counts = Counter(cluster_labels)
Upvotes: 3