Clustering (hclust) branch and cut labeling

Question

I am new to clustering. I want to cluster series that have good correlation together with one another. Start with an example:

set.seed(0)

r1 <- rnorm(1000)
r2 <- rnorm(1000)
e <- rnorm(1000)

d <- data.frame(r1=r1, r2=r2, e=e, r4=rnorm(1000), r.is.r1=r1, 
                r.is.almost.r1=r1 + rnorm(1000)/100, 
                r12=r1*0.75 + r2*0.25, 
                r21=r1*0.25 + r2*0.75, r21e=r1*0.25 + r2*0.75 + e/10,
                r21ee=r1*0.25 + r2*0.75 + e/2 )

print(round(cor(d), 2))

plot(hclust(dist(t(d)), method="centroid"))

which has the following correlations

                  r1    r2     e    r4 r.is.r1 r.is.almost.r1   r12   r21  r21e r21ee
r1              1.00 -0.01  0.01 -0.05    1.00           1.00  0.94  0.30  0.29  0.26
r2             -0.01  1.00  0.02 -0.02   -0.01          -0.01  0.32  0.95  0.94  0.82
e               0.01  0.02  1.00 -0.01    0.01           0.01  0.02  0.02  0.14  0.52
r4             -0.05 -0.02 -0.01  1.00   -0.05          -0.05 -0.05 -0.04 -0.04 -0.03
r.is.r1         1.00 -0.01  0.01 -0.05    1.00           1.00  0.94  0.30  0.29  0.26
r.is.almost.r1  1.00 -0.01  0.01 -0.05    1.00           1.00  0.94  0.30  0.29  0.26
r12             0.94  0.32  0.02 -0.05    0.94           0.94  1.00  0.59  0.59  0.51
r21             0.30  0.95  0.02 -0.04    0.30           0.30  0.59  1.00  0.99  0.87
r21e            0.29  0.94  0.14 -0.04    0.29           0.29  0.59  0.99  1.00  0.92
r21ee           0.26  0.82  0.52 -0.03    0.26           0.26  0.51  0.87  0.92  1.00

and

which intuitively seems good (except negative correlations [ as in r4 and e ] shouldn't be connected, but I can live with it; would love to kill this). I don't really understand what the height is, other than that more correlated series have lower heights and a perfect correlation sits at zero.

my first wish for the plot is to put a measure of the correlation in small lettering on the tree roots --- for example, 100% between r1 and r.is.r1. is this possible?

my second wish is to cut the tree (e.g., leaving only r4, e, r12, r21ee, r2 PLUS the two clusters, A being r.is.almost.r1,r1,r.is.r1 and B being r21,r21e and name the end branches. (my real application has hundreds of series, so I will need to cut it.)

somehow the plot needs to decide which of the three series in A or the two series in B I would want to put at the branch end. one option would be to paste all three into one string, which works for small trees. another option would be to figure out which of my series in the cluster seems most "central" and just name that one. a final option would be for me to designate some series as being better names as others (e.g., a preference for naming r1 over r.is.r1 and over r.is.almost.r1)

I hope these are two easy questions...

Clustering (hclust) branch and cut labeling

Answers (1)

Related Questions