Rstudent
Rstudent

Reputation: 885

Hierarchical clustering and k means

I want to run a hierarchical cluster analysis. I am aware of the hclust() function but not how to use this in practice; I'm stuck with supplying the data to the function and processing the output.

The main issue that I would like to cluster a given measurement.

I would also like to compare the hierarchical clustering with that produced by kmeans(). Again I am not sure how to call this function or use/manipulate the output from it.

My data are similar to:

df<-structure(list(id=c(111,111,111,112,112,112), se=c(1,2,3,1,2,3),t1 = c(1, 2, 1, 1,1,3),
                   t2 = c(1, 2, 2, 1,1,4), t3 = c(1, 0, 0, 0,2,1), t4 = c(2, 5, 7,  7,1,2),
                   t5 = c(1, 0, 1, 1,1,1),t6 = c(1, 1, 1, 1,1,1), t7 = c(1, 1, 1 ,1,1,1), t8=c(0,0,0,0,0,0)), row.names = c(NA,
                                                                                                                            6L), class = "data.frame")

I would like to run the hierarchical cluster analysis to identify the optimum number of clusters.

How can I run clustering based on a predefined measurement - in this case for example to cluster measurement number 2?

Upvotes: 1

Views: 538

Answers (1)

Duck
Duck

Reputation: 39613

For hierarchical clustering there is one essential element you have to define. It is the method for computing the distance between each data point. Clustering is an state of art technique so you have to define the number of clusters based on how fair data points are distributed. I will teach you how to do this in next code. We will compare three methods of distance using your data df and the function hclust():

First method is average distance, which computes the mean across all distances for all points. We will omit first variable as it is an id:

#Method 1
hc.average <- hclust(dist(df[,-1]),method='average')

Second method is complete distance, which computes the largest value across all distances for all points:

#Method 2
hc.complete<- hclust(dist(df[,-1]),method='complete')

Third method is single distance, which computes the minimal value across all distances for all points:

#Method 3
hc.single <- hclust(dist(df[,-1]),method='single')

With all models we can analyze the groups.

We can define the number of clusters based on the height of hierarchical tree, the largest the height then we will have only one cluster equals to all dataset. It is a standard to choose an intermediate value for height.

With average method a height value of three will produce four groups and a value around 4.5 will produce 2 groups:

plot(hc.average, xlab='')

Output:

enter image description here

With the complete method results are similar but the scale measure of height has changed.

plot(hc.complete, xlab='')

Output:

enter image description here

Finally, single method produces a different scheme for groups. There are three groups and even with an intermediate choice of height, you will always have that number of clusters:

plot(hc.single, xlab='')

Output:

enter image description here

You can use any method you wish to determine the cluster for your data using cutree() function, where you set the model object and the number of clusters. One way to determine clustering performance is checking how homogeneous the groups are. That depends of the researcher criteria. Next the method to add the cluster to your data. I will choose last model and three groups:

#Add cluster
df$Cluster <- cutree(hc.single,k = 3)

Output:

   id se t1 t2 t3 t4 t5 t6 t7 t8 Cluster
1 111  1  1  1  1  2  1  1  1  0       1
2 111  2  2  2  0  5  0  1  1  0       2
3 111  3  1  2  0  7  1  1  1  0       2
4 112  1  1  1  0  7  1  1  1  0       2
5 112  2  1  1  2  1  1  1  1  0       1
6 112  3  3  4  1  2  1  1  1  0       3

The function cutree() also has an argument called h where you can set the height, we have talked previously, instead of number of clusters k.

About your doubt of using some measure to define a cluster, you could scale your data excluding the desired variable so that the variable will have a different measure and can influence in the results of your clustering.

Upvotes: 1

Related Questions