user1375640
user1375640

Reputation: 151

extract groups from distance matrix using cutree in R

I started with a list of hobbies and people, I wanted to cluster those people by their common hobbies. So, I created a distance matrix then I applied the hierarchal clustering and cutree to group the clustering into specific number of cluster. Now I have the cutree matrix but I do not know how to extract the clusters from it. Would you please advice?

Here is an example of what I mean.

The distance matrix:

       one    three   two
one     0      1.0    1.0
three   1      0.0    0.5
two     1      0.5    0.0

Then I used the hclust and cutree and got this result:

hc <- hclust(dist, method="ward")
ct <- cutree(hc, k=1:3)
        1       2      3
one     1       1      1
three   1       2      2
two     1       2      3

How do I get a list of people that belong in the same cluster?

Thank you for your help.

Upvotes: 1

Views: 3339

Answers (2)

AdamO
AdamO

Reputation: 4920

Your k=1:3 will provide the predicted cluster for each of $k = {1, 2, 3}$. If you want to bundle groups according to cluster, assume WLOG that 2 is the number of clusters you're interested in, you simple need to concatenate the names of the matrix column by the matrix column entries.

Example:

hc <- hclust(dist(USArrests))
memb <- cutree(hc, k = 1:5)
tapply(names(memb[, 3]), memb[, 3], c) ## say we're interested in 3 clusters

Upvotes: 1

Gavin Simpson
Gavin Simpson

Reputation: 174778

ct is a matrix, so you can index the columns to get the membership for groups of sizes 1:3. For example,

cp[, 2]

gives the non-trivial solution of assigning 3 observations to 2 groups.

To get the observations in each cluster, then using your data:

Dij <- matrix(c(0, 1.0, 1.0,
                1, 0.0, 0.5,
                1, 0.5, 0.0), ncol = 3, byrow = TRUE)
rownames(Dij) <- colnames(Dij) <- c("one", "two", "three")
hc <- hclust(as.dist(Dij), method="ward")
ct <- cutree(hc, k=1:3)

you can use the split() function to split the row names of ct (which are you observation/sample identifiers from the distance matrix, Dij), breaking this up by the membership vector from whichever column of ct you want to use. E.g.

> split(rownames(ct), ct[,2])
$`1`
[1] "one"

$`2`
[1] "two"   "three"

Upvotes: 2

Related Questions