Alex
Alex

Reputation: 11

Key word argument "connectivity" in sklearn AgglomerativeClustering does not work as expected

In my Python code, I have a set of objects that I want to cluster based on a given distance matrix. However, there are some objects that should never end up in the same cluster. The number of clusters is chosen so that the problem is solvable. I want to use the AgglomerativeClustering class from the sklearn library. I set the distances of the incompatible objects to 1, which in some cases did not prevent them from ending up in the same cluster. I also tried to pass a connectivity matrix to AgglomerativeClustering via the keyword "connectivity". This does not work in all cases. Below is a boiled down problem.

If I do not use the connectivity keyword, I get the cluster labels [0 1 0 1 0]. This makes sense because incompatible objects 0 and 1 are not in the same cluster and incompatible objects 2 and 3 are not in the same cluster. However, when I supply the connectivity matrix I get the cluster labels [0 1 0 0 0]. This seems to contradict the connectivity relationships in the connectivity matrix as objects 2 and 3 should not be in the same cluster.

Am I not using the argument "connectivity" correctly? If so, how can it be used to achieve what I have described above?

import numpy as np
from sklearn.cluster import AgglomerativeClustering

n_clusters = 2

distance_matrix = np.array([
[0.,1.,0.,0.,0.],
[1.,0.,0.,0.,0.], 
[0.,0.,0.,1.,0.], 
[0.,0.,1.,0.,0.], 
[0.,0.,0.,0.,0.]
])

connectivity_matrix = np.array([
[True, False, True, True, True],
[False, True, True, True, True],
[True, True, True, False, True],
[True, True, False, True, True],
[True, True, True, True, True]
])

agglomerativeClustering = AgglomerativeClustering(n_clusters=n_clusters, metric='precomputed', 
                                                  linkage='average')
cluster_labels_wo_connectivity_matrix = agglomerativeClustering.fit_predict(distance_matrix)

agglomerativeClustering = AgglomerativeClustering(n_clusters=n_clusters, metric='precomputed', 
                                                  linkage='average', connectivity=connectivity_matrix)
cluster_labels_with_connectivity_matrix = agglomerativeClustering.fit_predict(distance_matrix)

print("Cluster labels without considering connectivity matrix: ", cluster_labels_wo_connectivity_matrix)
print("Cluster labels with considering connectivity matrix: ", cluster_labels_with_connectivity_matrix)

I tried to choose other types of "linkage" which some times helps and sometimes doesn't. I tried to create a callable that returns the connectivity matrix such as "kneighbors_graph" does in this example: https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering.html This did not make a difference.

The closest related question on stackoverflow is this one: Scikit-learn Agglomerative Clustering Connectivity Matrix

However, the graph implied by the connectivity matrix is connected in my case.

Upvotes: 1

Views: 80

Answers (0)

Related Questions