Reputation: 2224
I want to cluster a set of data, which is as follows:
{[1,2],
[2,3],
[3,2],
[9,8],
[8,10],
[7,9,8],
[7,10,5,9]
...
}
where data do not have fixed dimensions.
when K = 2, should be clustered the first 3 elements as one group and other 4 as one group.
I understand the k-means algorithm, but the problem is that its distance calculation is not suitable for my case. I use Jaccard distance for the distance of every two elements, because of various dimensions.
instead of computing means, one idea is to find the centroids of clusters. A centroid is a point which has the smallest sum of distances to all other points in a cluster.
I am working on the program according to above idea, implementing k-means++ clustering. I want a stable algorithm (output should not be extremely different in every run), should be relatively fast and must use Jaccard distance.
I am here to listen to advice because of this is my first time doing data clustering, so maybe be I am missing something. Please recommend me a suitable algorithm if there is one or point out my mistakes.
Upvotes: 0
Views: 157
Reputation: 77454
Rather than k-means - which needs a fixed number of continuous valued dimensions to compute means - why don't you use the much more appropriate
which can be used with Jaccard distance!
Upvotes: 1