Aechlys
Aechlys

Reputation: 1306

Java | Lucene | TokenStream fields cannot be stored

In the application I receive a text on which I apply filters, and I would like to store this filtered result into lucene Document object. I do not care about the original text.

String stringToProcess = "...";
TokenStream stream = analyzer.tokenStream(null, new StringReader(stringToProcess));
TokenStream procStream = new CustomFilter(stream, opts);

Document luceneDocument = new Document();
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setOmitNorms(false);
ft.setStoreTermVectors(true);
luceneDocument.add(new Field("content", procStream, ft));

This throws:

Exception in thread "main" java.lang.IllegalArgumentException: TokenStream fields cannot be stored

If I change the TextField.TYPE_STORED to TYPE_NOT_STORED there's no exception. However, the content of the field is null. There's a constructor for Field which clearly accepts TokenStream object.

I can manually extract the tokens from the procStream with .incrementToken() and .getAttribute(ChatTermAttribute.class).

My question: How can I pass the TokenStream to the Field object?

Upvotes: 2

Views: 496

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

You can't just pass in a TokenStream and store the field.

A TokenStream is a stream of analyzed, indexable tokens. The stored content of a field is the pre-analysis string. You are not providing that string to the field, so it doesn't have anything suitable to be stored, thus the exception.

Instead, it would be more typical to set the Analyzer in the IndexWriterConfig, and let it handle analyzing the field for you. I'm guessing the reason you are doing it this way instead of letting the IndexWriter handle it is because you want to add that CustomFilter to an out-of-the-box analyzer. Instead, just create your own custom Analyzer. Analyzers are easy. Just copy the source of the analyzer you want to use, and add your custom filter to the chain in createComponents. Say your using StandardAnalyzer, then you'd change the incrementToken method you copied to look like this:

@Override
protected TokenStreamComponents createComponents(final String fieldName) {
  final StandardTokenizer src = new StandardTokenizer();
  src.setMaxTokenLength(maxTokenLength);
  TokenStream tok = new StandardFilter(src);
  tok = new LowerCaseFilter(tok);
  tok = new StopFilter(tok, stopwords);
  tok = new CustomFilter(tok, opts); //Just adding this line
  return new TokenStreamComponents(src, tok) {
    @Override
    protected void setReader(final Reader reader) {
      src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
      super.setReader(reader);
    }
  };
}

Then you can create your field like:

new Field("content", stringToProcess, ft);

Okay, so I've assumed this is a bit of an XY problem. With the caveat that creating a custom analyzer is very likely the better solution, you actually can pass a TokenStream to the Field and store it as well, you just need to provide the string to store as well as the tokenstream. That would look something like this:

Field myField = new Field("content", stringToProcess, ft);
myField.setContentStream(procStream);

Upvotes: 2

Related Questions