betto86
betto86

Reputation: 714

How to define a custom similarity measure

I need some help defining a custom similarity measure.

I have a dataset whose elements are defined by 4 attributes. As an example, consider the following two items:

Element 1:

A1: "R1", "R3", "R4", "R7"
A2: "H1"
A3  "F1", "F2"
A4  "aaa" "bbb"


Element 2:

A1: "R1", "R2"
A2: "H1"
A3  "F1", "F2"
A4  "aaa" "bbb" "ccc" "ddd" "eee" "fff"

I have to implement a similarity measure which should satisfies the following conditions:

1 - If A2 value is the same, the two elements must belong to the same cluster

2 - If two elements have at least one common value on A4, the who elements must belong to the same cluster.

I need to use a sort of weighted Jaccard measure. Is it mathematically correct to define a similarity measure that sums the jaccard distance of each attribute and then to add a sort of high weigth if condition 1 and 2 are satisfied for A2 and A4?

If so, how can I transform the similarity matrix into a distance matrix?

Upvotes: 2

Views: 502

Answers (1)

Prune
Prune

Reputation: 77827

(1) Distance = 1 - similarity. This is a common characteristic.

(2) Summing the distances of the attributes is valid, although you may wish to scale it back to the [0, 1] range.

(3) Putting a high weight is not correct for what you've described. If the A2 or A4 values show a match, simply set the distance to 0. The clustering is a requirement, not merely strong advice. Is there some other semantic to your distance function, that you didn't want to take this route?

FYI, the basics for being a topological metric's distance function, D are:

D(a, a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)

Upvotes: 2

Related Questions