Nilani Algiriyage
Nilani Algiriyage

Reputation: 35776

Python get clustered data-Hierachical Clustering

I used following python script to do a hierarchical clustering and print the dendogram. Please consider I'm new to Data-mining.

import numpy as np
import distance
import scipy.cluster.hierarchy
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

mat = np.array([[ 0. , 1. , 3.  ,0. ,2.  ,3.  ,1.],
 [ 1. , 0. , 3. , 1.,  1. , 2. , 2.],
 [ 3.,  3. , 0.,  3. , 3.,  3. , 4.],
 [ 0. , 1. , 3.,  0. , 2. , 3.,  1.],
 [ 2. , 1.,  3. , 2.,  0. , 1.,  3.],
 [ 3. , 2.,  3. , 3. , 1. , 0. , 3.],
 [ 1. , 2.,  4. , 1. , 3.,  3. , 0.]])

linkage_matrix = linkage(mat, "single")

dendrogram(linkage_matrix,
           color_threshold=1,
           truncate_mode='lastp',
           distance_sort='ascending')

plt.show()

Following is the dendogram that I have got. I need to print the clusters and the data that belongs to each cluster?

cluster1 4,5
cluster2 ..,..
cluster3 ..,..

enter image description here

Upvotes: 1

Views: 2257

Answers (1)

Sudeep Juvekar
Sudeep Juvekar

Reputation: 5108

According to scipy.cluster.hierarchy documentation

(by running linkage...) A 4 by (n-1) matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

That means, we can iterate over linkage_matrix and find the actual nodes combined to form new cluster. Here is a small for loop to do that

n = len(mat)
cluster_dict = dict()
for i in range(0, 6):
    new_cluster_id = n+i
    old_cluster_id_0 = linkage_matrix[i, 0]
    old_cluster_id_1 = linkage_matrix[i, 1]
    combined_ids = list()
    if old_cluster_id_0 in cluster_dict:
        combined_ids += cluster_dict[old_cluster_id_0]
        del cluster_dict[old_cluster_id_0]
    else:
        combined_ids += [old_cluster_id_0]
    if old_cluster_id_1 in cluster_dict:
        combined_ids += cluster_dict[old_cluster_id_1]
        del cluster_dict[old_cluster_id_1]
    else:
        combined_ids += [old_cluster_id_1]
    cluster_dict[new_cluster_id] = combined_ids
    print cluster_dict

The code creates a dictionary of cluster id to the nodes contained in that cluster. In every iteration, it combines two nodes in linakge_matrix[i, 0] and linkage_matrix[i, 1] into a new node. Finally, it prints the running clusters in each iteration. The output is

{7: [0.0, 3.0]}
{8: [4.0, 5.0], 7: [0.0, 3.0]}
{8: [4.0, 5.0], 9: [6.0, 0.0, 3.0]}
{8: [4.0, 5.0], 10: [1.0, 6.0, 0.0, 3.0]}
{11: [4.0, 5.0, 1.0, 6.0, 0.0, 3.0]}
{12: [2.0, 4.0, 5.0, 1.0, 6.0, 0.0, 3.0]}

(The cluster ids start with 7, original rows are indexed 0 to 6). Note that the ids not in dictionary form their own cluster. For example, row 2.0 forms its own cluster in iteration 2. You can get final set of clusters by stopping early depending on your stopping criteria.

I can stop at iteration 3, (3rd row of output) and then my clusters will be:

cluster 8: {4, 5}
cluster 9: {6, 0, 3}
cluster 1: {1}
cluster 2: {2}

Upvotes: 3

Related Questions