TokenStream contract violation when using custom Analyzer with Lucene 4.9

Question

I have a few custom Analyzers like this one:

private static class ModelAnalyzer extends Analyzer
{
    @Override
    protected TokenStreamComponents createComponents(String string, Reader reader)
    {
        return new TokenStreamComponents(
            new StandardTokenizer(Version.LUCENE_4_9, reader),
            new LowerCaseFilter(Version.LUCENE_4_9,
                new NGramTokenFilter(Version.LUCENE_4_9,
                    new CharTokenizer(Version.LUCENE_4_9, reader)
                    {
                        @Override
                        protected boolean isTokenChar(int c)
                        {
                            return Character.isLetterOrDigit(c);
                        }
                    }, 3, 20)));
    }
}

They are added to a PerFieldAnalyzerWrapper and added to my IndexWriterConfig. When I try to rebuild my index I was always getting the error when adding the second document to my index:

java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

All I am doing is adding documents to my IndexWriter. I am not in any way touching these Filters or Tokenizers so there is no clean way for me to call reset() on them. Shouldn't the IndexWriter follow the "correct consuming workflow" without my help?

After 8 hours of reading everything on the web about this I gave up and just passed Version.LUCENE_4_5 to each of my tokenizers and filters so that the irritating state machine checks (that I understand were added in 4_6) are not run. This has fixed the problem, but I am at a loss as to the right way to make this work with 4.9. I have to assume I am making my Analyzers wrong or something, but I can't see how I could do it differently and it works just fine in the earlier version.

Datbates · Accepted Answer

Javi pointed me in the right direction by suggesting that it may be that my reader was being used twice. I went back to my analyzer and rewrote it from scratch taking advantage of the current pre-written components. This now works perfectly. Basically the key is to keep it simple and not try to port directly.

private static class ModelAnalyzer extends Analyzer
{
    @Override
    protected TokenStreamComponents createComponents(String string, Reader reader)
    {
        Tokenizer tokenizer = new NGramTokenizer(Version.LUCENE_4_9, reader, 3, 20)
        {
            @Override
            protected boolean isTokenChar(int c)
            {
                return Character.isLetterOrDigit(c);
            }
        };
        return new TokenStreamComponents(tokenizer,
            new LowerCaseFilter(Version.LUCENE_4_9, tokenizer));
    }
}

TokenStream contract violation when using custom Analyzer with Lucene 4.9

Answers (2)

Related Questions