Reputation: 11067
I have a number of different clustering processes that each splits a set into a number of labelled partitions.
e.g. Given the set {1,2,3,4,5,6}
, I might collect labellings such as:
A = [1,1,2,2,3,3]
B = [3,3,1,1,2,2]
C = [1,1,1,1,2,2]
D = [1,2,3,3,4,4]
Here A
and B
agree with one another 100%, since it's possible to create a 1-1 mapping (a.k.a. permutation) between labels that will translate A onto B or vice versa. Such a mapping might look like {1:3, 2:1, 3:2}
C
also has some agreement since both A
and B
can be mapped 1-1 into C
using either {1:1, 2:1, 3:2}
for the A->C
mapping, or {3:1, 1:1, 2:2}
for the B->C
mapping. Alternately, chaining mappings from A->B->C
is also possible.
So each labelling schemes partition the set, and in some cases 100% agree with one another, only without agreeing on a common labelling scheme.
D is a slightly differenct case where, because D
contains a higher number of partitions, it is not possible to map to it (but mapping from D down to A or C is possible if necessary)
It is possible to measure the similarity between partitionings using something like sklearn's normalized_mutual_info_score
but what I want to do is identify the best labelling mapping that can be performed to standardise the labelling schemes. Where 'best' is the mapping required in order for all partitionings to share a common set of labels.
In cases where the partitionings divide the set into a fixed "k" number of groups, I've found one method that works based on using the partitionings as a feature-set and k-means clustering over that set of features - but when the number of partitions in a given clustering differs, this doesn't work.
Since normalized_mutual_info_score
is able to "see-through" the differences in labelling, is there a way to identify a mapping that performs the same task, standardising different labellings so they can be more closely analysed?
Upvotes: 0
Views: 55
Reputation: 77474
The Hungarian algorithm is the method of choice to find the best mapping.
It solves a transportation problem. Each cluster is a supplier, and each label is a consumer. It maximizes the flow when matching each supplier to just one consumer.
Upvotes: 1