Reputation: 297
I would like for Lucene to find a document containing a term "bahnhofstr" if I search for "bahnhofstrasse", i.e., I don't only want to find documents containing terms of which my search term is a prefix but also documents that contain terms that are themselves a prefix of my search term...
How would I go about this?
Upvotes: 7
Views: 188
Reputation: 5703
If I understand you correctly, and your search string is an exact string, you can set queryParser.setAllowLeadingWildcard(true);
in Lucene to allow for leading-wildcard searches (which may or may not be slow -- I have seen them reasonably fast but in a case where there were only 60,000+ Lucene documents).
Your example query syntax could look something like:
*bahnhofstr bahnhofstr*
or possibly (have not tested this) just:
*bahnhofstr*
Upvotes: 1
Reputation: 33351
I think a fuzzy query might be most helpful for you. This will score terms based on the Levenshtein distance from your query. Without a minimum similarity specified, it will effectively match every term available. This can make it less than performant, but does accomplish what you are looking for.
A fuzzy query is signalled by the ~ character, such as:
firstname:bahnhofstr~
Or with a minimum similarity (a number between 0 and 1, 0 being the loosest with no minimum)
firstname:bahnhofstr~0.4
Or if you are constructing your own queries, use the FuzzyQuery
This isn't quite Exactly what you specified, but is the easiest way to get close.
As far as exactly what you are looking for, I don't know of a simple Lucene call to accomplish it. I would probably just split the term into a series of termqueries, that you could represent in a query string something like:
firstname:b
firstname:ba
firstname:bah
firstname:bahn
firstname:bahnh
firstname:bahnho
firstname:bahnhof
firstname:bahnhofs
firstname:bahnhofst
firstname:bahnhofstr*
I wouldn't actually generate a query string for it myself, by the way. I'd just construct the TermQuery and PrefixQuery objects myself.
Scoring would be bit warped, and I'dd probably boost longer queries more highly to get better ordering out of it, but that's the method that comes to mind to accomplish exactly what you're looking for fairly easily. A DisjunctionMaxQuery would help you use something like this with other terms and acquire more reasonable scoring.
Hopefully a fuzzy query works well for you though. Seems a much nicer solution.
Another option, if you have a lot of need for queries of this nature, might be, when indexing, tokenize fields into n-grams (see NGramTokenizer), which would allow you to effectively use an NGramPhraseQuery to achieve the results you want.
Upvotes: 0