Reputation: 3181
I need to calculate the similarity of a query and document in Lucene using Jaccard similarity over n-grams. As Jaccard similarity is is a very common measure in IR, I expected to find a Lucene implementation for it, but I couldn't.
Is anyone aware of such an implementation?
Upvotes: 3
Views: 2676
Reputation: 5354
The only implementation I'm aware of that can be easily integrated with Lucene is the one from LingPipe (please note that it's free only for non-commercial/research usage). Here is a blog post showing how to use it in LingPipe. A detailed explanation on how to connect both libraries is available on LingPipe website and in this book.
I haven't evaluated however, if it wouldn't be easier (also from license point of view) to integrate some other implementation on your own -- it's just a solution that worked for me.
Upvotes: 2
Reputation: 601
Try this library http://sourceforge.net/projects/simmetrics/ you find much more similarity functions. But I will recommend you to use SoftTFIDF from http://secondstring.sourceforge.net/, it has the best precision/recall according "A Comparison of String Distance Metrics for Name-Matching Tasks". William W. Cohen and others.
Upvotes: 1