Sonne
Sonne

Reputation: 23

Cluster groups based on pairwise distances

I have an n x n matrix with pairwise distances as entries. The matrix looks for example like this:

m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")

Now I want to put every letter in the same cluster if the distance to any other letter is 0. For the example above, I should get three clusters consisting of:

(A,B,E)

(C,F)

(D)

I would be interested in the number of entries in each cluster. At the end, I want to have a vector like:

clustersizes = c(3,2,1)

I assume it is possible by using the hclust function, but I'm not able to extract the three clusters. I also tried the cutree function, but if I don't know the number of clusters before and also not the cutoff for the height, how should I do it?

This is what I tried:

h <- hclust(dist(m),method="single")
plot(h) 

Thanks!

Upvotes: 2

Views: 945

Answers (2)

cpauvert
cpauvert

Reputation: 101

I ran into the same problem and ekstroem's answer did the trick, albeit that I used base R functions instead of igraph.

# The data
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
m
#>   A B C D E F
#> A 0 0 1 1 1 1
#> B 0 0 1 1 0 1
#> C 1 1 0 1 1 0
#> D 1 1 1 0 1 1
#> E 1 0 1 1 0 1
#> F 1 1 0 1 1 0

Your issue with hclust() and cutree() probably stem from a typo or a confusion between dist() and as.dist() (see below the differences). The former computes an Euclidean distance and the latter consider the matrix already as a distance (which is what I understood you actually wanted).

# Mind the difference between `dist()` and `as.dist()`
dist(m)
#>          A        B        C        D        E
#> B 1.000000                                    
#> C 2.000000 2.236068                           
#> D 1.732051 2.000000 1.732051                  
#> E 1.414214 1.000000 2.000000 1.732051         
#> F 2.000000 2.236068 0.000000 1.732051 2.000000
as.dist(m)
#>   A B C D E
#> B 0        
#> C 1 1      
#> D 1 1 1    
#> E 1 0 1 1  
#> F 1 1 0 1 1

Using as.dist() and the correct agglomeration method (or linkage, see ?hclust for details), you can find the expected groups.

# Make sure to specify your linkage method to single
# linkage (aka nearest neighbor, or friend-of-friend)
hc <- hclust(as.dist(m), method = "single")
plot(hc)

More importantly, you can extract the size of each cluster as you wanted.

# Extract the membership to the clusters and summarize
memberships <- cutree(hc, h = 0)
memberships
#> A B C D E F 
#> 1 1 2 3 1 2
table(memberships)
#> memberships
#> 1 2 3 
#> 3 2 1

Created on 2023-10-20 with reprex v2.0.2

Upvotes: 1

ekstroem
ekstroem

Reputation: 6191

Welcome to SO.

There are several ways to handle this but an easy choice is to use the igraph package.

First we convert your matrix m to an adjacency matrix. It contains the distances to neighbouring nodes, where 0 means no connection. Thus, we subtract your matrix from 1 to get that

mm <- 1 - m  
diag(mm) <- 0 # We don't allow loops

This gives

> mm
  A B C D E F
A 0 1 0 0 0 0
B 1 0 0 0 1 0
C 0 0 0 0 0 1
D 0 0 0 0 0 0
E 0 1 0 0 0 0
F 0 0 1 0 0 0

Then we just need to feed it to igraph to compute communities

library("igraph")
fastgreedy.community(as.undirected(graph.adjacency(mm)))

which produces

IGRAPH clustering fast greedy, groups: 3, mod: 0.44
+ groups:
  $`1`
  [1] "A" "B" "E"

  $`2`
  [1] "C" "F"

  $`3`
  [1] "D"

Now if you save that result you can get the community sizes right away

res < fastgreedy.community(as.undirected(graph.adjacency(mm)))
sizes(res)

which yields

Community sizes
1 2 3 
3 2 1 

Upvotes: 2

Related Questions