Reputation: 1192
I am searching fuzzily for first name and last name in Lucene (with a levenshtein distance of 2 each) in an input string/document by using fuzzy queries.
I am further expecting these two terms to be in close proximity to each other (in this example, separated by at most 3 terms) by using a SpanNearQuery. I also do not want order to matter.
My code for building the query:
FuzzyQuery firstNameQuery = new FuzzyQuery(new Term("text", firstName), 2);
FuzzyQuery lastNameQuery = new FuzzyQuery(new Term("text", lastName), 2);
SpanQuery[] clauses = new SpanQuery[] {
new SpanMultiTermQueryWrapper<MultiTermQuery>(firstNameQuery),
new SpanMultiTermQueryWrapper<MultiTermQuery>(lastNameQuery)
};
SpanNearQuery spanNearQuery = new SpanNearQuery(clauses, 3, false);
What I am seeing now in my unit test is that terms with levenshstein distance 1 seem to work, so "John Doa", "Jon Dox", etc. will match for "John Doe", but levenshstein distance 2 will not, e.g. "Johnnie Doe" will not match.
The span length is working fine, I can have up to 3 terms between first/last names.
Can someone enlighten me as to what I am doing wrong?
Update 1
Sorry, I messed up the example I contrived here and did not use the real data for privacy reasons.
What I am seeing is that the query does not work at all the way I imagined it to.
Input String: "Patient: John Doe" Query: spanNear([SpanMultiTermQueryWrapper(text:John~2), SpanMultiTermQueryWrapper(text:Doe~2)], 3, false)
This does not generate a hit, even though the terms should match exactly (edit distance of 0).
Upvotes: 1
Views: 1188
Reputation: 1131
Lucene 4.x fuzzy will match on edit distance 2 or less per term and your use case is having distance greater then 2 per term (John and Johnnie is distance 3).
In my opinion its not much recommend to use Lucene in-built fuzzy for name matching as it will not properly work for lengthy names (as its <2 distance per term) and its slower as it uses finite state for finding best possible matches.
Best and fastest way is to use "n-gram" approach for fuzzy matches (trigram fuzzy match is common!)
Update: Looks like you might have issue with upper case lower case, in my understanding lucene discards query analyzers for fuzzy search.
Can you try with "john" and "doe" (both lowercase) as your firstname and lastname and let me know if it works.
Upvotes: 1