Reputation: 71
I'm looking for a similarity measure (like the Jaccard Index) but I want to use known similarities between objects within the set, and weigh the connections by the item abundances. These known similarities are scores between 0 and 1, 1 indicating an exact match.
For example, consider two sets:
SET1 {A,B,C} and SET2 {A',B',C'}
I know that
{A,A'}, {B,B'}, {C,C'} each have an item similarity of 0.9. Hence, I would expect the similarity of SET1 and SET2 to be relatively high.
Another example would be: consider two sets SET1 {A,B,C} and SET2 {A,B',C',D,E,F,.....,Z}. Although the matches between the first three items are higher than in the first example, this score should likely be lower because of the size difference (as in Jaccard).
One more issue here is how to use abundances as weights, but I've got no idea as how to solve this.
In general, I need a normalized set similarity measure that takes into account this item similarity and abundancy.
Upvotes: 0
Views: 89
Reputation: 988
Correct me if I'm wrong but I guess you need clustering error as similarity measure. It is the proportion of points which are clustered differently in A' and A after an optimal matching of clusters. In other words, it is the scaled sum of the non-diagonal elements of the confusion matrix, minimized over all possible permutations of rows and columns. It uses the Hungarian algorithm to avoid high computational cost and it penalizes different number of elements in sets.
Upvotes: 1