Reputation: 714
I need some help defining a custom similarity measure.
I have a dataset whose elements are defined by 4 attributes. As an example, consider the following two items:
Element 1:
A1: "R1", "R3", "R4", "R7"
A2: "H1"
A3 "F1", "F2"
A4 "aaa" "bbb"
Element 2:
A1: "R1", "R2"
A2: "H1"
A3 "F1", "F2"
A4 "aaa" "bbb" "ccc" "ddd" "eee" "fff"
I have to implement a similarity measure which should satisfies the following conditions:
1 - If A2 value is the same, the two elements must belong to the same cluster
2 - If two elements have at least one common value on A4, the who elements must belong to the same cluster.
I need to use a sort of weighted Jaccard measure. Is it mathematically correct to define a similarity measure that sums the jaccard distance of each attribute and then to add a sort of high weigth if condition 1 and 2 are satisfied for A2 and A4?
If so, how can I transform the similarity matrix into a distance matrix?
Upvotes: 2
Views: 502
Reputation: 77827
(1) Distance = 1 - similarity. This is a common characteristic.
(2) Summing the distances of the attributes is valid, although you may wish to scale it back to the [0, 1] range.
(3) Putting a high weight is not correct for what you've described. If the A2 or A4 values show a match, simply set the distance to 0. The clustering is a requirement, not merely strong advice. Is there some other semantic to your distance function, that you didn't want to take this route?
FYI, the basics for being a topological metric's distance function, D are:
D(a, a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)
Upvotes: 2