Hussein Hazimeh
Hussein Hazimeh

Reputation: 9

Hierarchical agglomerative clustering

Can we use Hierarchical agglomerative clustering for clustering data in this format ?

"beirut,proff,email1"
"beirut,proff,email2"
"swiss,aproff,email1"
"france,instrc,email2"
"swiss,instrc,email2"
"beirut,proff,email1"
"swiss,instrc,email2"
"france,aproff,email2"

If not, what is the compatible clustering algorithm to cluster data with string values ?

Thank you for your help!

Upvotes: 0

Views: 895

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Hierarchical clustering is a rather flexible clustering algorithm. Except for some linkages (Ward?) it does not have any requirement on the "distance" - it could be a similarity as well, usually negative values will work just as well, you don't need triangle inequality etc.

Other algorithms - such as k-means - are much more limited. K-means minimizes variance; so it can only handle (squared) Euclidean distance; and it needs to be able to compute means, thus the data needs to be in a continuous, fixed dimensionality vector space; and sparsity may be an issue.

One algorithm that probably is even more flexible is Generalized DBSCAN. Essentially, it needs a binary decision "x is a neighbor of y" (e.g. distance less than epsilon), and a predicate to measure "core point" (e.g. density). You can come up with arbitary complex such predicates, that may no longer be a single "distance" anymore.

Either way: If you can measure similarity of these records, hiearchical clustering should work. The question is, if you can get enough similarity out of that data, and not just 3 bit: "has the same email", "has the same name", "has the same location" -- 3 bit will not provide a very interesting hierarchy.

Upvotes: 0

Sneftel
Sneftel

Reputation: 41484

Any type of clustering requires a distance metric. If all you're willing to do with your strings is treat them as equal to each other or not equal to each other, the best you can really do is the field-wise Hamming distance... that is, the distance between "abc,def,ghi" and "uvw,xyz,ghi" is 2, and the distance between "abw,dez,ghi" is also 2. If you want to cluster similar strings within a particular field -- say clustering "Slovakia" and "Slovenia" because of the name similarity, or "Poland" and "Ukraine" because they border each other, you'll use more complex metrics. Given a distance metric, hierarchical agglomerative clustering should work fine.

All this assumes, however, that clustering is what you actually want to do. Your dataset seems like sort of an odd use-case for clustering.

Upvotes: 0

Related Questions