StepTNT
StepTNT

Reputation: 3967

"TokenStream contract violation: close() call missing" when calling addDocument

I'm using Lucene's features to build a simple way to match similar words within a text.

My idea is to have have an Analyzer running on my text to provide a TokenStream, and for each token I run a FuzzyQuery to see if I have a match in my index. If not I just index a new Document containing just the new unique word.

Here's what I'm getting tho:

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
    at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:411)
    at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:111)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:165)
    at org.apache.lucene.document.Field.tokenStream(Field.java:568)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:708)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1562)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
    at org.myPackage.MyClass.addToIndex(MyClass.java:58)

Relevant code here:

// Setup tokenStream based on StandardAnalyzer
TokenStream tokenStream = analyzer.tokenStream(TEXT_FIELD_NAME, new StringReader(input));
tokenStream = new StopFilter(tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter(tokenStream, 3);
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
...
// Iterate and process each token from the stream
while (tokenStream.incrementToken()) {
    CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
    processWord(charTerm.toString());
}
...
// Processing a word means looking for a similar one inside the index and, if not found, adding this one to the index
void processWord(String word) {
    ...
    if (DirectoryReader.indexExists(index)) {
        reader = DirectoryReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs searchResults = searcher.search(query, 1);
        if (searchResults.totalHits > 0) {
            Document foundDocument = searcher.doc(searchResults.scoreDocs[0].doc);
            super.processWord(foundDocument.get(TEXT_FIELD_NAME));
        } else {
            addToIndex(word);
        }
    } else {
        addToIndex(word);
    }
    ...
}
...
// Create a new Document to index the provided word
void addWordToIndex(String word) throws IOException {
    Document newDocument = new Document();
    newDocument.add(new TextField(TEXT_FIELD_NAME, new StringReader(word)));
    indexWriter.addDocument(newDocument);
    indexWriter.commit();
}

The exception seems to tell that I should close the TokenStream before adding things to the index, but this doesn't really make sense to me because how are index and TokenStream related? I mean, index just receives a Document containing a String, having the String coming from a TokenStream should be irrelevant.

Any hint on how to solve this?

Upvotes: 1

Views: 2576

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

The problem is in your reuse of the same analyzer that the IndexWriter is trying to use. You have a TokenStream open from that analyzer, and then you try to index a document. That document needs to be analyzed, but the analyzer finds it's old TokenStream is still open, and throws an exception.

To fix it, you could create a new, separate analyzer for processing and testing the string, instead of using the one that IndexWriter is using.

Upvotes: 5

Related Questions