K-means text documents clustering. How calculate intra and inner similarity?

Question

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.

I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.

External similarity calculated as the average similarity of all pairs cluster centroid

I count right? It is based on my inner similarity values average from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.

Thank you so much for your advice!

K-means text documents clustering. How calculate intra and inner similarity?

Answers (1)

Related Questions