Reputation: 13716
Can one use scikit-learn (or another well-known python package) to get the Jaccard Similarity between a pair of sets?
I am only seeing the sklearn jaccard_similarity_score
function working on vectors/arrays/tensors of equal length, whereas I really do need the intersection-over-union calculation, which is a set calculation, not a computation over two same-sized tensors.
Maybe I should be using the multi-label-binarizer, exemplified here, if that's the intended way afforded by the scikit api.
Of course, it's few lines of code to implement myself with no package...
*this question is not a homework assignment, the slide I made for a non-technical audience at one time, and it illustrates the point here.
Just wondering.
Upvotes: 3
Views: 2740
Reputation: 16079
Numpy has some Set Routines built in. In this case as @Harpal pointed out you could use the intersect and union operations.
In pure python using intersection
and union
:
gold = ['A', 'B', 'C']
clf = ['A', 'D']
gold_s = set(gold)
clf_s = set(clf)
jac_sim = len(gold_s.intersection(clf_s)) / len(gold_s.union(clf_s))
jac_sim
0.25
In NumPy using intersect1d
and union1d
:
gold = np.array(gold)
clf = np.array(clf)
jac_sim = np.intersect1d(gold, clf).size / np.union1d(gold, clf).size
jac_sim
0.25
Granted the NumPy implementation is a good bit slower, but if your data is already in a NumPy array it may be faster than converting it to a set and doing the computation in Python. It all depends on the size of your data.
Upvotes: 5