Reputation: 53896
The values in this similarity matrix are based on jaccards coefficient :
a, b, c
a, 1, .3, .6
b, .3, 1, .9
c, .6, .9, 1
To generate a cluster analysis I used this code :
tb = read.csv("c:\\Users\\Adrian\\Desktop\\sim-matrix.csv", row.names=1);
d = as.dist(tb);
hclust(d);
plot(hclust(d, method="average"));
Which generates this dendogram :
a
& b
grouped close together"average"
, average the corresponding values for a
, b
& c
? ?hclust
does not provide any details
Upvotes: 1
Views: 267
Reputation: 7602
I dont know, what d = as.dist(tb);
does, but I think hclust(d, method="average")
assumes d
to be a distance matrix.
Why are a & b grouped close together
If you provide a similarity matrix the low similarity of .3
between a
and b
is interpreted as a low distance, thus a high similarity. That would explain why a
and b
are grouped first.
How is closeness measured?
Since you provided the similarity matrix, I think you are referring to how the closeness of clusters is measured when using average linkage. Assuming that the first point is appropriate, average linkage (I think in hclust average is WPGMA) takes the average similarities between all observations in distinct clusters. Lets check that:
Step 1:
Average similarities
a-b
: .3
a-c
: .6
c-b
: .9
So we merge a
and b
at .3
Step 2:
Average similarities
ab-c
: (.6 + .9) / (2*1) = 1.5 / 2 = .75
So merging ab-c
should be at .75
. Well, either the calculation of mine is wrong or the dendrogram corresponds to complete linkage.
Upvotes: 1
Reputation: 18759
The problem is that you never say at any point to your code that this is a similarity index. In fact you specifically say the opposite: as.dist(d)
. hclust
takes a matrix of distance, i. e. dissimilarity. The simplest way to go for you is:
tb <- matrix(c(1,.3,.6,.3,1,.9,.6,.9,1),nrow=3)
tb <- 1-tb #Similarity to dissimilarity
d <- as.dist(tb)
plot(hclust(d))
Closeness (as you asked) was measured when you measured your Jaccard index.
Upvotes: 0