Reputation: 768
I am using python Kmean clustering algorithm for cluster document. I have created a term-document matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
Then I applied Kmean clustering using following code
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
My next task is to see the top terms in every cluster, searching on googole suggested that many of the people has used the km.cluster_centers_.argsort()[:, ::-1] for finding the top term in the clusters using the following code:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
Now my question is that to my understanding km.cluster_centers_ returns the coordinated of the center of the clusters so for example if there are 100 features and three clusters it would return us a matrix of 3 rows and 100 column representing a centroid for each cluster. What I wish to understand how it is used in the above code to determine the top terms in the cluster. Thanks Any comments are appreciated Nadeem
Upvotes: 4
Views: 9219
Reputation: 1284
A little late to the game but I had the same question but couldn't find a satisfactory answer.
Here's what I did:
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# documents you are clustering
docs = ['first document', 'second', 'third doc', 'etc.'] * 10
# init vectorizer
tfidf = TfidfVectorizer()
# fit vectorizer and get vecs
vecs = tfidf.fit_transform(docs)
# fit your kmeans cluster to vecs
# don't worry about the hyperparameters
clusters = MiniBatchKMeans(
n_clusters=16,
init_size=1024,
batch_size=2048,
random_state=20
).fit_predict(vecs)
# get dict of {keyword id: keyword name}
labels = tfidf.get_feature_names()
def get_cluster_keywords(vecs, clusters, labels, top_n=10):
# init a dict where we will count term occurence
cluster_keyword_ids = {cluster_id: {} for cluster_id in set(clusters)}
# loop through the vector and cluster of each doc
for vec, cluster_id in zip(vecs, clusters):
# inspect non zero elements of rows of sparse matrix
for j in vec.nonzero()[1]:
# check we have seen this keword in this cluster before
if j not in cluster_keyword_ids[cluster_id]:
cluster_keyword_ids[cluster_id][j] = 0
# add a count
cluster_keyword_ids[cluster_id][j] += 1
# cluster_keyword_ids contains ids
# we need to map back to keywords
# do this with the labels param
return {
cluster_id: [
labels[keyword_id] # map from kw id to keyword
# sort through our keyword_id_counts
# only return the top n per cluster
for keyword_id, count in sorted(
keyword_id_counts.items(),
key=lambda x: x[1], # sort from highest count to lowest
reverse=True
)[:top_n]
] for cluster_id, keyword_id_counts in cluster_keyword_ids.items()
}
Then you can run:
>>> get_cluster_keywords(vecs, clusters, labels, top_n=10)
{0: ['document', 'first'], 1: ['second'], 2: ['doc', 'third'], 3: ['etc']}
Upvotes: 1
Reputation: 6103
You're correct about the shape and meaning of the cluster centers. Because you're using Tf-Idf vectorizer, your "features" are the words in a given document (and each document is its own vector). Thus, when you cluster the document vectors, each "feature" of the centroid represents the relevance of that word to it. "word" (in vocabulary)="feature" (in your vector space)="column" (in your centroid matrix)
The get_feature_names
call gets the mapping of column index to the word it represents (so it seems from the documentation... if that doesn't work as expected, just reverse the vocabulary_
matrix to get the same result).
Then the .argsort()[:, ::-1]
line converts each centroid into a sorted (descending) list of the columns most "relevant" (highly-valued) in it, and hence the words most relevant (since words=columns).
The rest of the code is just printing, I'm sure that doesn't need any explaining. All the code is really doing is sorting each centroid in descending order of the features/words most valued in it, then mapping those columns back to their original words and printing them.
Upvotes: 6