yooons
yooons

Reputation: 127

I don't understand the detailed behavior of the threshold working in fcluster (method ='complete')

Xi=[[0,5,10,8,3],[5,0,1,3,2],[10,1,0,5,1],[8,3,5,0,6],[3,2,1,6,0]]

Xi = Distance matrix

shc.fcluster(shc.linkage(Xi,'complete'),9,criterion='distance')

in this code threshold = 9

after clustering result is array([3, 1, 1, 2, 1], dtype=int32)

i don't understand why not array [2 ,1 ,1, 1, 1]

this image means after clustering https://drive.google.com/file/d/17806FuPuNpJiqhT12jiuFOMGNUvB1vjT/view?usp=sharing

Upvotes: 1

Views: 2044

Answers (1)

Max Pierini
Max Pierini

Reputation: 2249

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
import seaborn as sns

You have this distance matrix

Xi = np.array([[0,5,10,8,3],[5,0,1,3,2],[10,1,0,5,1],[8,3,5,0,6],[3,2,1,6,0]])

we can visualize as

df = pd.DataFrame(Xi)
# fill NaNs and mask 0s
df.fillna(0, inplace=True)
mask = np.zeros_like(df)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df, annot=True, fmt='.0f', cmap="YlGnBu", mask=mask);

enter image description here

Now, we get the pdist

p = pdist(Xi)

and the linkage

Z = linkage(p, method='complete')

You set 9 as threshold so

dendrogram(Z)
plt.axhline(9, color='k', ls='--');

enter image description here

you have 3 clusters

fcluster(Z, 9, criterion='distance')

array([3, 1, 1, 2, 1], dtype=int32)
#      0  1  2  3  4   <- elements

and it's correct, you can verify with the dendrogram that

  • elements 1, 2 and 4 in cluster 1
  • element 3 in cluster 2
  • element 0 in cluster 3

If you want two cluster only, you have to choose 12, for example, as thershold

dendrogram(Z)
plt.axhline(12, color='k', ls='--');

enter image description here

and so you have your expected result

fcluster(Z, 12, criterion='distance')

array([2, 1, 1, 1, 1], dtype=int32)
#      0  1  2  3  4   <- elements

Upvotes: 3

Related Questions