Reputation: 6093
I have problems in finding a proper similarity measure for clustering. I have around 3000 arrays of sets, where each set contains features of certain domain (e.g., number, color, days, alphabets, etc). I'll explain my problem with an example.
Lets assume i have only 2 arrays(a1 & a2) and I want to find the similarity between them. each array contains 4 sets (in my actual problem there are 250 sets (domains) per array) and a set can be empty.
a1: {a,b}, {1,4,6}, {mon, tue, wed}, {red, blue,green}
a2: {b,c}, {2,4,6}, {}, {blue, black}
I have come with a similarity measure using Jaccard index (denoted as J):
sim(a1,a2) = [J(a1[0], a2[0]) + J(a1[1], a2[1]) + ... + J(a1[3], a2[3])]/4
note:I divide by total number of sets (in the above example 4) to keep the similarity between 0 and 1.
Is this a proper similarity measure and are there any flaws in this approach
. I am applying Jaccard index for each set separately because I want compare the similarity between related domains(i.e. color with color, etc...)
I am not aware of any other proper similarity measure for my problem.
Further, can I use this similarity measure for clustering purpose?
Upvotes: 0
Views: 613
Reputation: 77454
This should work for most clustering algorithms. Don't use k-means - it can handle numeric vector spaces only. But you have a vector-of-sets type of data.
You may want to use a different mean than the arithmetic average for combining the four Jaccard measures. Try the harmonic or geometric means. See, the average over 250 values will likely be somewhere close to 0.5 all the time, so you need a mean that is more "aggressive".
So the plan sounds good. Just try it, implement this similarity and plug it into various clustering algorithm and see if they find something. I like OPTICS for exploring data and distance functions, as the OPTICS plot can be very indicative whether (or not!) there is something to be found based on the distance function. If the plot is too flat, there just is not much to be found, it is like a representative sample of the distances in the data set...
I use ELKI, and they even have a tutorial on adding custom distance functions: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions although you can probably just compute the distances with whatever tool you like and write them to a similarity matrix. At 3000 objects this will remain very manageable, 4200000 doubles is just a few MB.
Upvotes: 1