Emily Erdos
Emily Erdos

Reputation: 7

Comparing Kmeans and Agglomerative Clustering

I'm working to compare kmeans and agglomerative clustering methods for a country-level analysis, but I'm struggling because they are assigning numerical clusters differently, so even if a country (let's say Canada) is in a cluster by itself, that cluster is '1' in one version and '2' in another. How would I reconcile that both methods clustered this the same way, but assigned a different order?

I tried some hacky things with averaging the two, but am struggling to figure the logic out.

 geo_group      a_cluster k_cluster
   <chr>              <int>     <int>
 1 United States          1         1
 2 Canada                 2         3
 3 United Kingdom         3         5
 4 Australia              4         5
 5 Germany                4         5
 6 Mexico                 5         6
 7 France                 5         5
 8 Sweden                 6         8
 9 Brazil                 6         6
10 Netherlands            6         6

Upvotes: 0

Views: 1584

Answers (1)

danlooo
danlooo

Reputation: 10627

Agglomerative clustering and kmeans are different methods to define a partition of a set of samples (e.g. samples 1 and 2 belong to cluster A and sample 3 belongs to cluster B).

kmeans calculates the Euclidean distance between each sample pair. This is only possible for numerical features and is often only useful for spatial data (e.g. longitude and latitude), because here Eukledian distance is just the distance as the crow flies.

Agglomerative clustering, however, can be used with many other dissimilarity measures, not just metric distances, even e.g. Jaccard allowing not only numerical but also categorical data.

Furthermore, the number of clusters can be defined afterwards whereas in kmeans, the chosen k affects the clustering in the first place. Here, in agglomerative clustering, clusters were merged together in a hierarchical manner. This merging can be either single, complete, or average linkage resulting in not just one but many different agglomerative algorithms.

It is very normal to get different results from these methods.

Upvotes: 1

Related Questions