Reputation: 796
I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:
If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.
Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?
Thanks!
Upvotes: 4
Views: 2617
Reputation: 36339
It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".
Upvotes: 1
Reputation: 2348
You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).
The basic process is this:
HashSet
s, one per term. Or, if you want to give different weights to each word, use HashMap
where the word is the key.There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).
Upvotes: 2
Reputation: 2871
Lucene does support fuzzy search based on Levenshtein distance.
https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches
But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm
Upvotes: 1