Reputation: 770
I need to cluster 500K+ strings based on their similarity.
I have calculated their pair-wise Levenshtein Distances and made a sparse similarity matrix. This matrix contains binary similarities: values for small distances are set to 1.0 and others are 0.0.
I don't know what kind of clustering is good for me. I don't know the number of clusters in advance but it may be considerably large because the similarity matrix is very sparse (about 0.1% values are non-zero).
Upvotes: 0
Views: 524
Reputation: 412
have you considered doing something like https://en.wikipedia.org/wiki/Soundex ? the advantage in such algorithms is that similar words have the same canonical form. For example, both "Robert" and "Rupert" return the same string "R163". Then your clustering boils down to a map like:
clusters = { canonical_form: [list of similar words] }
Naturally, you can tweak the Soundex rules according to your domain.
Upvotes: 2