Reputation: 23830
I am coding my application each function so i am not using tools which does everything for you
Been looking for solution when to cut my agglomerative hierarchical clustering
How do i cluster?
I have coded application in c# 4.5.2
So far i am using standard hierarchical which uses Euclidean_Distance to calculate distance between document pairs
Also it uses UPGMA to calculate distance between clusters to decide merge which ones
I also coded Rand Index and F Measure to test my manually labeled data-set success
However the problem is when stop merging more clusters
I am really bad at understanding mathematical equations without real data example or a well explained pseudo code
There are mathematical equations everywhere but no real life example
So looking for your answers. For example it is written in many places Bayesian information criterion (BIC) is a good solution but i cant figure out how to apply it to my software
I also have other distance or similarity metrics such as cosine similarity or Sorensen Dice Distance etc
There are so many questions on stackexchange or stackoverflow about this but all answers are using tools
like matlab or R or etc
Upvotes: 0
Views: 2723
Reputation: 19601
Try to compute some measure of how well each particular clustering fits - for example, the sum of distances from cluster centres, or the sum of squared errors. You should find that this error decreases as you increase the number of clusters - it is easier to fit with more clusters, and increases as you decrease the number of clusters.
Now draw a graph and look for an "elbow" where the error starts to get large more quickly as the number of clusters decreases. You could then assume that the minimum number of clusters before the error starts increasing very rapidly is the true number of clusters in the data.
See for example the graph in Cluster analysis in R: determine the optimal number of clusters just below the text "We might conclude that 4 clusters would be indicated by this method:"
Upvotes: 2