Proper similarity measure for clustering

Question

I have problems in finding a proper similarity measure for clustering. I have around 3000 arrays of sets, where each set contains features of certain domain (e.g., number, color, days, alphabets, etc). I'll explain my problem with an example.

Lets assume i have only 2 arrays(a1 & a2) and I want to find the similarity between them. each array contains 4 sets (in my actual problem there are 250 sets (domains) per array) and a set can be empty.

a1: {a,b}, {1,4,6}, {mon, tue, wed}, {red, blue,green}
a2: {b,c}, {2,4,6}, {}, {blue, black}

I have come with a similarity measure using Jaccard index (denoted as J):

sim(a1,a2) = [J(a1[0], a2[0]) + J(a1[1], a2[1]) + ... + J(a1[3], a2[3])]/4

note:I divide by total number of sets (in the above example 4) to keep the similarity between 0 and 1.

Is this a proper similarity measure and are there any flaws in this approach. I am applying Jaccard index for each set separately because I want compare the similarity between related domains(i.e. color with color, etc...)

I am not aware of any other proper similarity measure for my problem. Further, can I use this similarity measure for clustering purpose?

Proper similarity measure for clustering

Answers (1)

Related Questions