Which Lucene SearchAnalyzer to be used for special character search

Question

I am using Lucene.net Standard Analyzer in my ASP.NET project search. But that search does not return results for keywords like C#, .NET etc. But if I type C or NET (removing . and #) it works. On Stackoverflow (which uses Lucene too), I noticed that when I type .NET it changes it to [.NET] while searching, I got links which say Standard Analyzer is not able to handle special character search, and White Space Analyzer will not work for us as it does not give expected results. Can anyone help on how SO manages search?

femtoRgon · Accepted Answer

I'll characterize what SO is doing a bit more closely here:

While I'm not really privy to the implementation details of StackOverflow, you'll note the same behavior when searching for "java" or "hibernate", even though these would have no issues with standard analyzer. They will be transformed into "[java]" and "[hibernate]". That just denotes a tag search. That doesn't happen where searching for "lucene" or "junit" so it probably has to do with the popularity of the tags. I would definitely suspect that tag titles would be indexed in an un-analyzed form.

For an interesting example, try out "j++". This dead-end java implementation has a mere 8 questions using the j++ tag on SO, so it won't trigger to automatic tag search. Search "[j++]", and you'll see those 8. Search "j++", and you'll have a rough time finding anything relevant to that particular language, but you'll find plenty that reference j.

Onward, to fixing your problem:

Yes, StandardAnalyzer will (speaking imprecisely, see UAX-29 for the precise rules) get rid of all your punctuation. The typical approach to this, is to use the same analyzer when querying. If you use StandardAnalyzer to analyze your queries as well as the indexed documents, your searched terms will match, the two query terms mentioned above will be reduced to net and c, and you should get results.

But now, you've hit upon perhaps the classic example of a problem with StandardAnalyzer. This means that c, c++, and c# will all be represented precisely the same in the index, there is no way to search for one without matching the other two!

There are a few ways to deal with this, to my mind:

Throw the baby out with the bathwater: Use WhitespaceAnalyzer or some such, and lose all the nice, fancy things StandardAnalyzer does to help you out.
Just handle those few little edge cases: Okay, so Lucene doesn't like punctuation, and you have some known terms that have a problem with that. Luckily, you have String.Replace. Replace them with something a little more lucene-friendly, like "c", "cplusplus" and "csharp". Again, make sure it gets done both at query and index time. The problem is: Since you are doing this outside of the analyzer, the transformation will affect the stored version of the field as well, forcing you to reverse it before you display results to the user.
Do the same as #2, but just a bit fancier: So #2 might work all right, but you've already got these analyzers handling transforming data for consumption by lucene, which only affect the indexed version of a field, rather than the stored one. Why not use them? Analyzer has a call, initReader, in which you can slap a CharFilter on the front of the analyzer stack (See the example way down at the bottom of the Analysis package documentation). The text run through the analyzer will be transformed by the CharFilter before the StandardTokenizer (which is what gets rid of the punctuation, among other things) gets it's hands on it. MappingCharFilter, for instance.

You can't subclass StandardAnalyzer, though, the thinking being that you should be implementing Analyzer, rather than subclassing implementations of it (see the discussion here, if you're interested in a more complete discussion of the thought process there). So, assuming we want to make sure we get absolutely all the functionality of StandardAnalyzer in the deal, just copy-paste the source code, and add an override of the initReaders method:

public class ExtraFancyStandardAnalyzer extends StopwordAnalyzerBase {

    public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;

    private int maxTokenLength = DEFAULT_MAX_TOKEN_LENGTH;

    public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;

    public ExtraFancyStandardAnalyzer(Version matchVersion,
            CharArraySet stopWords) {
        super(matchVersion, stopWords);
        buildMap();
    }

    public ExtraFancyStandardAnalyzer(Version matchVersion) {
        this(matchVersion, STOP_WORDS_SET);
    }

    public ExtraFancyStandardAnalyzer(Version matchVersion, Reader stopwords)
            throws IOException {
        this(matchVersion, loadStopwordSet(stopwords, matchVersion));
    }

    public void setMaxTokenLength(int length) {
        maxTokenLength = length;
    }

    public int getMaxTokenLength() {
        return maxTokenLength;
    }


    // The following two methods, and a call to buildMap() in the ctor
    // are the only things changed from StandardAnalyzer

    private NormalizeCharMap map;

    public void buildMap() {
        NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
        builder.add("c++", "cplusplus");
        builder.add("c#", "csharp");
        map = builder.build();
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        return new MappingCharFilter(map, reader);
    }

    @Override
    protected TokenStreamComponents createComponents(final String fieldName,
            final Reader reader) {
        final StandardTokenizer src = new StandardTokenizer(matchVersion,
                reader);
        src.setMaxTokenLength(maxTokenLength);
        TokenStream tok = new StandardFilter(matchVersion, src);
        tok = new LowerCaseFilter(matchVersion, tok);
        tok = new StopFilter(matchVersion, tok, stopwords);
        return new TokenStreamComponents(src, tok) {
            @Override
            protected void setReader(final Reader reader) throws IOException {
                src.setMaxTokenLength(ExtraFancyStandardAnalyzer.this.maxTokenLength);
                super.setReader(reader);
            }
        };
    }
}

Note: This is written and tested in Java, Lucene version 4.7. The C# implementation shouldn't be too much different. Copy the StandardAnalyzer, build a MappingCharFilter (which is actually just a hair simpler to deal with in version 3.0.3), and wrap the reader with it in an override of the initReader method.

Which Lucene SearchAnalyzer to be used for special character search

Answers (1)

Related Questions