Reputation: 2876
I am new to clustering. I want to cluster series that have good correlation together with one another. Start with an example:
set.seed(0)
r1 <- rnorm(1000)
r2 <- rnorm(1000)
e <- rnorm(1000)
d <- data.frame(r1=r1, r2=r2, e=e, r4=rnorm(1000), r.is.r1=r1,
r.is.almost.r1=r1 + rnorm(1000)/100,
r12=r1*0.75 + r2*0.25,
r21=r1*0.25 + r2*0.75, r21e=r1*0.25 + r2*0.75 + e/10,
r21ee=r1*0.25 + r2*0.75 + e/2 )
print(round(cor(d), 2))
plot(hclust(dist(t(d)), method="centroid"))
which has the following correlations
r1 r2 e r4 r.is.r1 r.is.almost.r1 r12 r21 r21e r21ee
r1 1.00 -0.01 0.01 -0.05 1.00 1.00 0.94 0.30 0.29 0.26
r2 -0.01 1.00 0.02 -0.02 -0.01 -0.01 0.32 0.95 0.94 0.82
e 0.01 0.02 1.00 -0.01 0.01 0.01 0.02 0.02 0.14 0.52
r4 -0.05 -0.02 -0.01 1.00 -0.05 -0.05 -0.05 -0.04 -0.04 -0.03
r.is.r1 1.00 -0.01 0.01 -0.05 1.00 1.00 0.94 0.30 0.29 0.26
r.is.almost.r1 1.00 -0.01 0.01 -0.05 1.00 1.00 0.94 0.30 0.29 0.26
r12 0.94 0.32 0.02 -0.05 0.94 0.94 1.00 0.59 0.59 0.51
r21 0.30 0.95 0.02 -0.04 0.30 0.30 0.59 1.00 0.99 0.87
r21e 0.29 0.94 0.14 -0.04 0.29 0.29 0.59 0.99 1.00 0.92
r21ee 0.26 0.82 0.52 -0.03 0.26 0.26 0.51 0.87 0.92 1.00
and
which intuitively seems good (except negative correlations [ as in r4
and e
] shouldn't be connected, but I can live with it; would love to kill this). I don't really understand what the height is, other than that more correlated series have lower heights and a perfect correlation sits at zero.
my first wish for the plot is to put a measure of the correlation in small lettering on the tree roots --- for example, 100%
between r1 and r.is.r1. is this possible?
my second wish is to cut the tree (e.g., leaving only r4
, e
, r12
, r21ee
, r2
PLUS the two clusters, A being r.is.almost.r1,r1,r.is.r1
and B being r21,r21e
and name the end branches. (my real application has hundreds of series, so I will need to cut it.)
somehow the plot needs to decide which of the three series in A or the two series in B I would want to put at the branch end. one option would be to paste
all three into one string, which works for small trees. another option would be to figure out which of my series in the cluster seems most "central" and just name that one. a final option would be for me to designate some series as being better names as others (e.g., a preference for naming r1
over r.is.r1
and over r.is.almost.r1
)
I hope these are two easy questions...
Upvotes: 0
Views: 106
Reputation: 24139
There should be an easier way but here is a plot of the the heights onto the tree.
I am not sure how you want to measure the correlation. A height of 0 means perfect correlation and but height could go to very large for very distance points. Maybe you could scale it using the range from 0 or min(cluster$height) to max(cluster$height).
For part 2, it is not clear on the criteria to cut or prune the tree. Maybe the cuttree()
function?
This should provide a start.
set.seed(0)
r1 <- rnorm(1000)
r2 <- rnorm(1000)
e <- rnorm(1000)
d <- data.frame(r1=r1, r2=r2, e=e, r4=rnorm(1000), r.is.r1=r1,
r.is.almost.r1=r1 + rnorm(1000)/100,
r12=r1*0.75 + r2*0.25,
r21=r1*0.25 + r2*0.75, r21e=r1*0.25 + r2*0.75 + e/10,
r21ee=r1*0.25 + r2*0.75 + e/2 )
print(round(cor(d), 2))
cluster <- hclust(dist(t(d)), method="centroid")
plot(cluster)
#calculate the locations of the labels
distance <- as.data.frame(cluster$merge)
for(i in 1:length(cluster$height)) {
if(distance$V1[i] <0){
d1 <- which( names(d)[cluster$order] %in% names(d)[abs(distance$V1[i])])
} else
{ d1 <- distance$loc[distance$V1[i]] }
if(distance$V2[i] <0){
d2 <- which( names(d)[cluster$order] %in% names(d)[abs(distance$V2[i])])
} else
{ d2 <- distance$loc[distance$V2[i]] }
distance$loc[i] <- mean(c(d1, d2))
}
#Plot the heights on the chart
dummy<-lapply(1:length(cluster$height), function(i) {
text(distance$loc[i], cluster$height[i]-0.5, format(cluster$height[i], digits=3) )
})
Upvotes: 1