Clustering a long list of words

Question

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster. For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster.

I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space. I have words of strings at my hand here. It seems that, the question of how to represent strings in a numerical vector space and to calculate "means" of string clusters is not sufficiently answered, according to my survey efforts until now. A naive approach to attack this problem would be to combine k-Means clustering with Levenshtein distance, but the question still remains "How to represent "means" of strings?". There is a weight called as TF-IDF weigt, but it seems that it is mostly related to the area of "text document" clustering, not for the clustering of single words. It seems that there are some special string clustering algorithms existing, like the one at http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf

My search in this area is going on still, but I wanted to get ideas from here as well. What would you recommend in this case, is anyone aware of any methods for this kind of problem?

Has QUIT--Anony-Mousse · Accepted Answer

Don't look for clustering. This is misleading. Most algorithms will (more or less forcefully) break your data into a predefined number of groups, no matter what. That k-means isn't the right type of algorithm for your problem should be rather obvious, isn't it?

This sounds very similar; the difference is the scale. A clustering algorithm will produce "macro" clusters, e.g. divide your data set into 10 clusters. What you probably want is that much of your data isn't clustered at all, but you want to want to merge near-duplicate strings, which may stem from errors, right?

Levenshtein distance with a threshold is probably what you need. You can try to accelerate this by using hashing techniques, for example.

Similarly, TF-IDF is the wrong tool. It's used for clustering texts, not strings. TF-IDF is the weight assigned to a single word (string; but it is assumed that this string does not contain spelling errors!) within a larger document. It doesn't work well on short documents, and it won't work at all on single-word strings.

Clustering a long list of words

Answers (2)

Related Questions