Python Kmeans Print absolute frequency of words in each cluster

Question

hello is there a way to print out the absolute frequencies of each word in a cluster? My Code looks like this:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(list)

true_k = 4

model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

model.fit(X)

print("Top terms per cluster:")

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :5]:
        print(' %s' % terms[ind],)
    print

My results are e.g.:

Top Terms per Cluster:

Cluster 0:

house

roof

table

chair

tv

Cluster 1:

...

But I want something like this, with absolute frequencies of each word:

Cluster 0:

house 65

roof 45

table 44

chair 33

tv 18

Thank you in advance :)

Siddhant Tandon · Accepted Answer

Not sure what is the need of tfidfvectorizer on words. But anyway using kmeans just predict on the cluster label for each word. And simply check word frequency in each cluster by doing a df[df.cluster==#somelabel].words.value_counts

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

words = ['this','is','a','very','long','text','my','name','is','not','cortana','today','I','will',
'write','a','long','text','I','am','from','planet','earth','this','text','does','not','make',
 'sense']

#tfidf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(words)

#kmeans
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
lab = model.predict(X)

#save cluster labels for each sample in a dataframe 
df = pd.DataFrame({'words':words, 'cluster':lab})

#check word freq for cluster==1
df[df.cluster==1].words.value_counts()

Python Kmeans Print absolute frequency of words in each cluster

Answers (1)

Related Questions