Apoorv
Apoorv

Reputation: 179

Jaccard similarity in python

I am trying to find the jaccard similarity between two documents. However, i am having hard time to understand how the function sklearn.metrics.jaccard_similarity_score() works behind the scene.As per my understanding the Jaccard's sim = intersection of the terms in docs/ union of the terms in docs.

Consider below example: My DTM for the two documents is:

array([[1, 1, 1, 1, 2, 0, 1, 0],
       [2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)

above func. give me the jaccard sim score

print(sklearn.metrics.jaccard_similarity_score(tf_matrix[0,:],tf_matrix[1,:]))
0.25

I am trying to find the score on my own as :

intersection of terms in both the docs = 4
total terms in doc 1 = 6
total terms in doc 2 = 6
Jaccard = 4/(6+6-4)= .5

Can someone please help me understand if there is something obvious i am missing here.

Upvotes: 2

Views: 11115

Answers (2)

Kunam
Kunam

Reputation: 78

According to me

intersection of terms in both the docs = 2.

peek to peek intersection according to their respective index. As we need to predict correct value for our model.

Normal Intersection = 4. Leaving the order of index.

# so,
jaccard_score = 2/(6+6-4) = 0.25

Upvotes: 0

enezhadian
enezhadian

Reputation: 946

As stated here:

In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.

Therefore in your example it is calculating the proportion of matching elements. That's why you're getting 0.25 as the result.

Upvotes: 2

Related Questions