Reputation: 23
I have an n x n matrix with pairwise distances as entries. The matrix looks for example like this:
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
Now I want to put every letter in the same cluster if the distance to any other letter is 0. For the example above, I should get three clusters consisting of:
(A,B,E)
(C,F)
(D)
I would be interested in the number of entries in each cluster. At the end, I want to have a vector like:
clustersizes = c(3,2,1)
I assume it is possible by using the hclust function, but I'm not able to extract the three clusters. I also tried the cutree function, but if I don't know the number of clusters before and also not the cutoff for the height, how should I do it?
This is what I tried:
h <- hclust(dist(m),method="single")
plot(h)
Thanks!
Upvotes: 2
Views: 945
Reputation: 101
I ran into the same problem and ekstroem's answer did the trick, albeit that I used base R functions instead of igraph
.
# The data
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
m
#> A B C D E F
#> A 0 0 1 1 1 1
#> B 0 0 1 1 0 1
#> C 1 1 0 1 1 0
#> D 1 1 1 0 1 1
#> E 1 0 1 1 0 1
#> F 1 1 0 1 1 0
Your issue with hclust()
and cutree()
probably stem from a typo or a confusion between dist()
and as.dist()
(see below the differences). The former computes an Euclidean distance and the latter consider the matrix already as a distance (which is what I understood you actually wanted).
# Mind the difference between `dist()` and `as.dist()`
dist(m)
#> A B C D E
#> B 1.000000
#> C 2.000000 2.236068
#> D 1.732051 2.000000 1.732051
#> E 1.414214 1.000000 2.000000 1.732051
#> F 2.000000 2.236068 0.000000 1.732051 2.000000
as.dist(m)
#> A B C D E
#> B 0
#> C 1 1
#> D 1 1 1
#> E 1 0 1 1
#> F 1 1 0 1 1
Using as.dist()
and the correct agglomeration method (or linkage, see ?hclust
for details), you can find the expected groups.
# Make sure to specify your linkage method to single
# linkage (aka nearest neighbor, or friend-of-friend)
hc <- hclust(as.dist(m), method = "single")
plot(hc)
More importantly, you can extract the size of each cluster as you wanted.
# Extract the membership to the clusters and summarize
memberships <- cutree(hc, h = 0)
memberships
#> A B C D E F
#> 1 1 2 3 1 2
table(memberships)
#> memberships
#> 1 2 3
#> 3 2 1
Created on 2023-10-20 with reprex v2.0.2
Upvotes: 1
Reputation: 6191
Welcome to SO.
There are several ways to handle this but an easy choice is to use the igraph
package.
First we convert your matrix m
to an adjacency matrix. It contains the distances to neighbouring nodes, where 0 means no connection. Thus, we subtract your matrix from 1 to get that
mm <- 1 - m
diag(mm) <- 0 # We don't allow loops
This gives
> mm
A B C D E F
A 0 1 0 0 0 0
B 1 0 0 0 1 0
C 0 0 0 0 0 1
D 0 0 0 0 0 0
E 0 1 0 0 0 0
F 0 0 1 0 0 0
Then we just need to feed it to igraph
to compute communities
library("igraph")
fastgreedy.community(as.undirected(graph.adjacency(mm)))
which produces
IGRAPH clustering fast greedy, groups: 3, mod: 0.44
+ groups:
$`1`
[1] "A" "B" "E"
$`2`
[1] "C" "F"
$`3`
[1] "D"
Now if you save that result you can get the community sizes right away
res < fastgreedy.community(as.undirected(graph.adjacency(mm)))
sizes(res)
which yields
Community sizes
1 2 3
3 2 1
Upvotes: 2