Reputation: 775
I have large dictionary of "pairwise similarity matrixes" that would look like the following:
similarity['group1']
:
array([[1. , 0. , 0. , 0. , 0. ],
[0. , 1. , 0.09 , 0.09 , 0. ],
[0. , 0.09 , 1. , 0.94535157, 0. ],
[0. , 0.09 , 0.94535157, 1. , 0. ],
[0. , 0. , 0. , 0. , 1. ]])
In short, every element of the previous matrix is the probability that record_i
and record_j
are similar (values being 0 and 1 inclusive), 1
being exactly similar and 0
being completely different.
I then feed each similarity matrix into an AffinityPropagation
algorithm in order to group / cluster similar records:
sim = similarities['group1']
clusterer = AffinityPropagation(affinity='precomputed',
damping=0.5,
max_iter=25000,
convergence_iter=2500,
preference=????)) # ISSUE here
affinity = clusterer.fit(sim)
cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_
However, since I run the above on multiple similarity matrixes, I need to have a generalised preference
parameter which I can't seem to tune.
It says in the docs that it's by default set as the median of the similarity matrix, however I get lots of false positives with this setup, the mean sometimes work sometimes gives too many clusters etc...
e.g: when playing with the preference parameter, these are the results I get from the similarity matrix
preference = default # which is the median (value 0.2) of the similarity matrix
: (incorrect results, we see that the record 18
shouldn't be there because the similarity with the other records is very low):
# Indexes of the elements in Cluster n°5: [15, 18, 22, 27]
{'15_18': 0.08,
'15_22': 0.964546229533378,
'15_27': 0.6909703138051403,
'18_22': 0.12, # Not Ok, the similarity is too low
'18_27': 0.19, # Not Ok, the similarity is too low
'22_27': 0.6909703138051403}
preference = 0.2 in fact from 0.11 to 0.26
: (correct results as the records are similar):
# Indexes of the elements in Cluster n°5: [15, 22, 27]
{'15_22': 0.964546229533378,
'15_27': 0.6909703138051403,
'22_27': 0.6909703138051403}
My question is: How should I choose this preference
parameter in a way that would generalise?
Upvotes: 3
Views: 1721
Reputation: 775
A naive and brute force grid search
solution can be implemented as such if a connection is scored less than a certain threshold (0.5 for example), we'd re-run the clustering with an adjusted value of the preference
parameter.
A naive implementation would be like the following.
First, a function to test whether a clustering needs tuning, the threshold being 0.5
in this example:
def is_tuning_required(similarity_matrix, rows_of_cluster):
rows = similarity_matrix[rows_of_cluster]
for row in rows:
for col_index in rows_of_cluster:
score = row[col_index]
if score > 0.5:
continue
return True
return False
Build a preference range of values against which the clustering would run:
def get_pref_range(similarity):
starting_point = np.median(similarity)
if starting_point == 0:
starting_point = np.mean(similarity)
# Let's try to accelerate the pace of values picking
step = 1 if starting_point >= 0.05 else step = 2
preference_tuning_range = [starting_point]
max_val = starting_point
while max_val < 1:
max_val *= 1.25 if max_val > 0.1 and step == 2 else step
preference_tuning_range.append(max_val)
min_val = starting_point
if starting_point >= 0.05:
while min_val > 0.01:
min_val /= step
preference_tuning_range.append(min_val)
return preference_tuning_range
A normal AfinityPropagation
, with a preference
parameter passed:
def run_clustering(similarity, preference):
clusterer = AffinityPropagation(damping=0.9,
affinity='precomputed',
max_iter=5000,
convergence_iter=2500,
verbose=False,
preference=preference)
affinity = clusterer.fit(similarity)
labels = affinity.labels_
return labels, len(set(labels)), affinity.cluster_centers_indices_
The method we would actually call with a similarity (1 - distance) matrix as an argument:
def run_ideal_clustering(similarity):
preference_tuning_range = get_pref_range(similarity)
best_tested_preference = None
for preference in preference_tuning_range:
labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)
needs_tuning = False
wrong_clusters = 0
for label_index in range(labels_count):
cluster_elements_indexes = np.where(labels == label_index)[0]
tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
if tuning_required:
wrong_clusters += 1
if not needs_tuning:
needs_tuning = True
if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
best_tested_preference = (preference, wrong_clusters)
if not needs_tuning:
return labels, labels_count, cluster_centers_indices
# The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
return run_clustering(similarity, preference)
Obviously, this is a brute force solution which will not be performant in large datasets / similarity matrixes.
If a simpler and better solution gets posted I'll accept it.
Upvotes: 1