Reputation: 4960
I have a few thousand strings that look something like this
I'm want to provide a search that works so that
etc. All return the first string (and possibly others), but
do not.
I.e. both exact prefix match and token (potentially prefix) matches in the right order should match, independent of case, but all tokens in the query must be in the document.
I've set up the typical Lucene example using a StandardAnalyzer and a common QueryParser with a prefix query.
I figure I might need a BinaryQuery to state that I need all tokens in the query to be in the documents but I can't quite figure to get at the tokens to build it (The query is user-supplied). I also realize that using a StringField instead of a TextField gives me the exact string matches as opposed to the token-wise matches, but I'm not sure if that's something I can combine with the above?
How should I go about this? I don't even have to use Lucene to do it, but it looked like a good fit.
Upvotes: 0
Views: 1278
Reputation: 33351
Your first one (non-critical) is a little trickier, but to make sure all terms in the query are found in all results, you can simply make all of the query terms required. You can do this with either added plus operators:
+foo +bar
+foo +ba*
(if you want to handle prefixes, you'll need to add the wildcards to specify it, or possibly use an ngram tokenizer, or some such)Or, you can just set the default operator to be AND
, using StandardQueryParser.setDefaultOperator
queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
In the case of herp foo
vs foo herp
, phrase slop will probably, I think, get you where you need to be. Swapping the order of terms will add two to the distance, so:
"foo herp"~2
: matches "Foo-Bar-Herp""herp foo"~2
: does notPhrase queries do not support wildcards though, so if you need to combine this with prefix terms, you'll run into problems.
If you want to allow more slop than that without the order being changed, then I believe you are moving outside the ability of the QueryParser
to express your query, and will need to go to the SpanQuery
API to construct your queries manually.
Constructing queries manually, you could do something like:
SpanQuery term1 = new SpanTermQuery(new Term("content", "foo"));
SpanQuery term2Prefix = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("content", "her")));
SpanQuery finalQuery = new SpanNearQuery(new SpanQuery[] {term1, term2Prefix}, 5, true);
Which looks for the the first term (exact match) and a prefix of the second term, in order, with no more than five terms between them.
Upvotes: 1