Jan Chimiak
Jan Chimiak

Reputation: 746

Lucene query parser to use filters for wildcard queries

My problem is how to parse wildcard queries with Lucene that the query term is passed through a TokenFilter.

I'm using a a custom Analyzer with several filers (e.g. ASCIIFoldingFilter, but that's only an example). My problem is that whenever Lucene's QueryParser detects that one of the sub-queries is a WildcardQuery, it by design [1] ignores the Analyzer.

This means that a query for über is filtered correctly,

über -> uber

but a query for über* (with a wildcard) is not passed through a filter at all:

über* -> über*

Obviously this means - as index-side all tokens are filtered - that there can be no matches on any query containing ü...

Q: How do I force Lucene to filter the query for the WildCard queries, too? I'm looking for a way which would at least marginally re-use Lucene's codebase ;-)

Note: As an input I receive a query string, so building queries programmatically is not an option. Note: I'm using Lucene 4.5.1.

[1] http://www.gossamer-threads.com/lists/lucene/java-user/14224

Context:

// analyzer applies filters in Analyzer#createComponents (String, Reader)
Analyzer analyzer = new CustomAnalyzer (Version.LUCENE_45); 

// I'm using org.apache.lucene.queryparser.classic.MultiFieldQueryParser
QueryParser parser = new MultiFieldQueryParser (Version.LUCENE_45, fields, analyzer);
parser.setAllowLeadingWildcard (true);
parser.setMultiTermRewriteMethod (MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

// actual parsing of the input query
Query query = parser.parse (input);

Upvotes: 1

Views: 3736

Answers (1)

Jan Chimiak
Jan Chimiak

Reputation: 746

Ok, I found a solution: I'm extending QueryParser to override #getWildcardQuery (String, String). This way I can intercept and alter the term after a wildcard query is detected and before it is created:

@Override
protected Query getWildcardQuery (String field, String termStr) throws ParseException
{
    String term = termStr;
    TokenStream stream = null;
    try
    {
        // we want only a single token and we don't want to lose special characters
        stream = new KeywordTokenizer (new StringReader (term));

        stream = new LowerCaseFilter (Version.LUCENE_45, stream);
        stream = new ASCIIFoldingFilter (stream);

        CharTermAttribute charTermAttribute = stream.addAttribute (CharTermAttribute.class);

        stream.reset ();
        while (stream.incrementToken ())
        {
            term = charTermAttribute.toString ();
        }
    }
    catch (IOException e)
    {
        LOGGER.debug ("Failed to filter search query token {}", term, e);
    }
    finally
    {
        IOUtils.closeQuietly (stream);
    }
    return super.getWildcardQuery (field, term);
}

This solution is based on similar questions:

Using a Combination of Wildcards and Stemming

How to get a Token from a Lucene TokenStream?

Note: in my code it's actually a bit more convoluted to keep all filters in the single location...

I still feel that there should be a better solution, though.

Upvotes: 2

Related Questions