neo33
neo33

Reputation: 1879

How to give sense to clusters of Kmean?

Hello I am using Kmeans to build a topic classifier, my idea is to take several Facebook comments from different users to have several documents.

My list of documents looks as follows:

list=["comment1","comment2",...,"commentN"]

Then I used tfidf to vectorize every comment and assign it to a specific cluster, the output of my program is the following:

tfidf = tfidf_vectorizer.fit_transform(list)
tf = tf_vectorizer.fit_transform(list)    
print("size of tf",tf.shape)
print("size of tfidf",tfidf.shape)   
#Creating clusters from data
kmeans = KMeans(n_clusters=8, random_state=0).fit(tf)   
print("printing labels",kmeans.labels_)    
#Printing the number of clusters 
print("Number of clusters",set(kmeans.labels_))
print("dimensions of matrix labels",(kmeans.labels_).shape)
#Predicting new labels
y_pred = kmeans.predict(tf)
print("dimensions of predict matrix",y_pred.shape)

My output looks as follows:

size of tf (202450, 2000)
size of tfidf (202450, 2000)
printing labels [1 1 1 ..., 1 1 1]
Number of clusters {0, 1, 2, 3, 4, 5, 6, 7}
dimensions of matrix labels (202450,)
dimensions of predict matrix (202450,)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64.
  warnings.warn(msg, DataConversionWarning)

Now the problema is that I would like to find a way to give sense to this clusters I mean the class 0 is about sports, class 1 is talking about politics, so I would like to appreciate any recomendation to understand this clusters, or at least to find a way to get all the commments that belongs to a specific cluster to then interpret this result thanks for the support.

Upvotes: 0

Views: 69

Answers (1)

Rachid Ait Abdesselam
Rachid Ait Abdesselam

Reputation: 385

There are multiple approaches

The easiest approache is to get the centroid of each cluster, it is a good summary of most words used in the cluster.

The second approache is to get the sub matrix of tf-idf of element assigned to each cluster, after that you can use ACP on sub matrix to extract factors , and understand more The composition of each cluster.

Sorry I do not use sckit-learn, so I cannot help you by code

Hop that will help

Upvotes: 1

Related Questions