Camilla8
Camilla8

Reputation: 171

Jaccard Index Python

I want to use Jaccard Index to find the similarity between two sets. I found a Jaccard Index implementation here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html but the input of the funciton of the library have to be a List, while in my case I prefer Set.

I wrote this code:

from sklearn.metrics import jaccard_similarity_score



def jaccard_index(first_set, second_set):
    """ Computes jaccard index of two sets
        Arguments:
          first_set(set):
          second_set(set):
        Returns:
          index(float): Jaccard index between two sets; it is
            between 0.0 and 1.0
    """
    # If both sets are empty, jaccard index is defined to be 1
    index = 1.0
    if first_set or second_set:
        index = (float(len(first_set.intersection(second_set)))
             / len(first_set.union(second_set)))

    return index

y_pred = [0, 2, 1, 3, 5]
y_true = [0, 1, 2, 3, 7]
a={0,2,1,3,5}
b={0,1,2,3,7}
print jaccard_similarity_score(y_true, y_pred)
print jaccard_similarity_score(y_true, y_pred, normalize=False)
print(jaccard_index(a,b))

These are the outputs of the 3 print:

0.4
2
0.666666666667

Why are they different from my implementation (0.666666666667)? Why is the second result 2? Shouldn't the Jaccard Index be between 0 and 1? Which one is the best implementation and which one should I use?

Upvotes: 1

Views: 5221

Answers (1)

Andrey Lukyanenko
Andrey Lukyanenko

Reputation: 3851

From the documentation:

If normalize == True, return the average Jaccard similarity coefficient,
else it returns the sum of the Jaccard similarity coefficient over the sample set.

By the way, you can see the code of sklearn implementation here

__

I see now the main problem - it is due to the nature of sets. You have the line a={0,2,1,3,5}. After this a becames equal to {0, 1, 2, 3, 5}, because using set causes automatical sorting of the data. a and b are sorted independently from each other, and as a result similarity is calculated not between original lists, but different lists. So you can't use set, because original position of elements is important.

Upvotes: 2

Related Questions