abroekhof
abroekhof

Reputation: 796

Fuzzy string search in Java, including word swaps

I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:

If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.

Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?

Thanks!

Upvotes: 4

Views: 2617

Answers (3)

Ingo
Ingo

Reputation: 36339

It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".

Upvotes: 1

Anon
Anon

Reputation: 2348

You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).

The basic process is this:

  • Initialization: tokenize your search terms, and store them in a series of HashSets, one per term. Or, if you want to give different weights to each word, use HashMap where the word is the key.
  • Processing: tokenize each input string, and probe each of the sets of search terms to determine how closely they apply to the input. See above for a description of algorithms.

There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).

Upvotes: 2

Nishan
Nishan

Reputation: 2871

Lucene does support fuzzy search based on Levenshtein distance.

https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches

But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm

Upvotes: 1

Related Questions