Reputation: 27
hello is there a way to print out the absolute frequencies of each word in a cluster? My Code looks like this:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(list)
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i,)
for ind in order_centroids[i, :5]:
print(' %s' % terms[ind],)
print
My results are e.g.:
Top Terms per Cluster:
Cluster 0:
house
roof
table
chair
tv
Cluster 1:
...
But I want something like this, with absolute frequencies of each word:
Cluster 0:
house 65
roof 45
table 44
chair 33
tv 18
Thank you in advance :)
Upvotes: 0
Views: 542
Reputation: 701
Not sure what is the need of tfidfvectorizer on words. But anyway using kmeans just predict on the cluster label for each word. And simply check word frequency in each cluster by doing a df[df.cluster==#somelabel].words.value_counts
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
words = ['this','is','a','very','long','text','my','name','is','not','cortana','today','I','will',
'write','a','long','text','I','am','from','planet','earth','this','text','does','not','make',
'sense']
#tfidf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(words)
#kmeans
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
lab = model.predict(X)
#save cluster labels for each sample in a dataframe
df = pd.DataFrame({'words':words, 'cluster':lab})
#check word freq for cluster==1
df[df.cluster==1].words.value_counts()
Upvotes: 2