Reputation: 169
I am trying to understand "Jaccard similarity" between 2 arrays of type double having values greater than zero and less than one.
Till now i have searched many websites for this but what I found is that the both arrays should be of same size(Number of elements in array 1 should be equal to number of elements in array 2). But I am having different number of elements in both arrays. Is there any way to implement "jaccard similarity" ?
Upvotes: 2
Views: 4450
Reputation: 640
Sorry for necroposting, but the answer above was marked as the correct one. Jaccard similarity coefficient from @AgapwIesu answer can be maximum 0.5 if collections are fully identical. At least, you need to multiply numerator x2 to normalize it, like this:
var CommonNumbers = from a in A.AsEnumerable<double>()
join b in B.AsEnumerable<double>() on a equals b
select a;
double JaccardIndex = 2*(((double) CommonNumbers.Count()) /
((double) (A.Count() + B.Count())));
Please note, that this similarity coefficient is not intersection, devided by union as defined at Wikipedia. If you want to get intersection, devided by union using LINQ, you can try this code:
private static double JaccardIndex(IEnumerable<double> A, IEnumerable<double> B)
{
return (double)A.Intersect(B).Count() / (double)A.Union(B).Count();
}
Take into account, that Union
and Intersect
works with unique objects, so you should be careful working with non-unique collections:
List<int> A = new List<int>() { 1, 1, 1, 1 };
List<int> B = new List<int>() { 1, 1, 1, 1 };
Console.WriteLine(A.Union(B).Count()); // = 1, not 4
Console.WriteLine(A.Intersect(B).Count()); // = 1, not 4
Upvotes: 3
Reputation:
Using C#'s LINQ ...
Say you have an array of doubles named A and another named B. This will give you the Jaccard index:
var CommonNumbers = from a in A.AsEnumerable<double>()
join b in B.AsEnumerable<double>() on a equals b
select a;
double JaccardIndex = (((double) CommonNumbers.Count()) /
((double) (A.Count() + B.Count())));
The first statement gets a list of numbers that appear in both arrays. The second computes the index - that is just the size of the intersection (how many numbers appear in both arrays) divided by the size of the union (size, or rather count, of the one array plus the count of the other).
Upvotes: 4
Reputation:
Jaccard similarity is an index of the size of intersection between two sets, divided by the size of the union. In your case, you'd have to write the code to find out how many elements appear in both arrays, then divide that by the sum of the size of both arrays.
Upvotes: 2