matanox
matanox

Reputation: 13716

Python package function for Jaccard Similarity between sets?

Can one use scikit-learn (or another well-known python package) to get the Jaccard Similarity between a pair of sets?

I am only seeing the sklearn jaccard_similarity_score function working on vectors/arrays/tensors of equal length, whereas I really do need the intersection-over-union calculation, which is a set calculation, not a computation over two same-sized tensors.

Maybe I should be using the multi-label-binarizer, exemplified here, if that's the intended way afforded by the scikit api.

Of course, it's few lines of code to implement myself with no package...

enter image description here *this question is not a homework assignment, the slide I made for a non-technical audience at one time, and it illustrates the point here.

Just wondering.

Upvotes: 3

Views: 2740

Answers (1)

Grr
Grr

Reputation: 16079

Numpy has some Set Routines built in. In this case as @Harpal pointed out you could use the intersect and union operations.

In pure python using intersection and union:

gold = ['A', 'B', 'C']
clf = ['A', 'D']
gold_s = set(gold)
clf_s = set(clf)
jac_sim = len(gold_s.intersection(clf_s)) / len(gold_s.union(clf_s))
jac_sim
0.25

In NumPy using intersect1d and union1d :

gold = np.array(gold)
clf = np.array(clf)
jac_sim = np.intersect1d(gold, clf).size / np.union1d(gold, clf).size
jac_sim
0.25

Granted the NumPy implementation is a good bit slower, but if your data is already in a NumPy array it may be faster than converting it to a set and doing the computation in Python. It all depends on the size of your data.

Upvotes: 5

Related Questions