Reputation: 675
I was trying to develop a clustering algorithm tasked with finding k classes on a set of 2D points, (with k given as input) using use the Kruskal algorithm lightly modified to find k spanning trees instead of one.
I compared my output to a proposed optimum (1) using the rand index, which for k = 7 resulted on 95.5%. The comparison can be seen on the link below.
Problem:
The set have 5 clearly spaced clusters that are easily classified by the algorithm, but the results are rather disappointing for k > 5, which is when things start to get tricky. I believe that my algorithm is correct, and maybe the data is particularly bad for a Kruskal approach. Single Linkage Agglomerative Clustering, such as Kruskal's, are known to perform badly at some problems since it reduces the assessment of cluster quality to a single similarity between a pair of points.
The idea of the algorithm is very simple:
Bottomline: Why is the algorithm failing like that? Is it Kruskal's fault? If so, why precisely? Any suggestions to improve the results without abandoning Kruskal?
(1): Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data(TKDD),2007.1(1):p.1-30.
Upvotes: 6
Views: 2128
Reputation: 77454
This is known as single-link effect.
Kruskal seems to be a semi-clever way of computing single-linkage clustering. The naive approach for "hierarchical clustering" is O(n^3)
, and the Kruskal approach should be O(n^2 log n)
due to having to sort the n^2
edges.
Note that SLINK can do single-linkage clustering in O(n^2)
runtime and O(n)
memory.
Have you tried loading your data set e.g. into ELKI, and compare your result to single-link clustering.
To get bette results, try other linkages (usually in O(n^3)
runtime) or density-based clustering such as DBSCAN (in O(n^2)
without index, and O(n log n)
with index). On this toy data set, epsilon=2
and minPts=5
should work good.
Upvotes: 5
Reputation: 19601
The bridges between clusters that should be different are a classic example of Kruskal getting things wrong. You might try, for each point, overwriting the shortest distance from that point with the second shortest distance from that point - this might increase the lengths in the bridges without increasing other lengths.
By eye, this looks like something K-means might do well - except for the top left, the clusters are nearly circular.
Upvotes: 2
Reputation: 12592
You can try the manhattan distance but to get better you can try a classic line and circle detection algorithm.
Upvotes: 0