Tomáš jedno
Tomáš jedno

Reputation: 11

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.

I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.

External similarity calculated as the average similarity of all pairs cluster centroid

I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.

Thank you so much for your advice!

Upvotes: 1

Views: 1451

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!

For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Upvotes: 1

Related Questions