Reputation: 746
My problem is how to parse wildcard queries with Lucene that the query term is passed through a TokenFilter
.
I'm using a a custom Analyzer
with several filers (e.g. ASCIIFoldingFilter
, but that's only an example). My problem is that whenever Lucene's QueryParser
detects that one of the sub-queries is a WildcardQuery
, it by design [1] ignores the Analyzer
.
This means that a query for über is filtered correctly,
über -> uber
but a query for über* (with a wildcard) is not passed through a filter at all:
über* -> über*
Obviously this means - as index-side all tokens are filtered - that there can be no matches on any query containing ü...
Q: How do I force Lucene to filter the query for the WildCard queries, too? I'm looking for a way which would at least marginally re-use Lucene's codebase ;-)
Note: As an input I receive a query string, so building queries programmatically is not an option. Note: I'm using Lucene 4.5.1.
[1] http://www.gossamer-threads.com/lists/lucene/java-user/14224
Context:
// analyzer applies filters in Analyzer#createComponents (String, Reader)
Analyzer analyzer = new CustomAnalyzer (Version.LUCENE_45);
// I'm using org.apache.lucene.queryparser.classic.MultiFieldQueryParser
QueryParser parser = new MultiFieldQueryParser (Version.LUCENE_45, fields, analyzer);
parser.setAllowLeadingWildcard (true);
parser.setMultiTermRewriteMethod (MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
// actual parsing of the input query
Query query = parser.parse (input);
Upvotes: 1
Views: 3736
Reputation: 746
Ok, I found a solution: I'm extending QueryParser
to override #getWildcardQuery (String, String)
. This way I can intercept and alter the term after a wildcard query is detected and before it is created:
@Override
protected Query getWildcardQuery (String field, String termStr) throws ParseException
{
String term = termStr;
TokenStream stream = null;
try
{
// we want only a single token and we don't want to lose special characters
stream = new KeywordTokenizer (new StringReader (term));
stream = new LowerCaseFilter (Version.LUCENE_45, stream);
stream = new ASCIIFoldingFilter (stream);
CharTermAttribute charTermAttribute = stream.addAttribute (CharTermAttribute.class);
stream.reset ();
while (stream.incrementToken ())
{
term = charTermAttribute.toString ();
}
}
catch (IOException e)
{
LOGGER.debug ("Failed to filter search query token {}", term, e);
}
finally
{
IOUtils.closeQuietly (stream);
}
return super.getWildcardQuery (field, term);
}
This solution is based on similar questions:
Using a Combination of Wildcards and Stemming
How to get a Token from a Lucene TokenStream?
Note: in my code it's actually a bit more convoluted to keep all filters in the single location...
I still feel that there should be a better solution, though.
Upvotes: 2