Reputation: 1306
In the application I receive a text on which I apply filters, and I would like to store this filtered result into lucene Document
object. I do not care about the original text.
String stringToProcess = "...";
TokenStream stream = analyzer.tokenStream(null, new StringReader(stringToProcess));
TokenStream procStream = new CustomFilter(stream, opts);
Document luceneDocument = new Document();
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setOmitNorms(false);
ft.setStoreTermVectors(true);
luceneDocument.add(new Field("content", procStream, ft));
This throws:
Exception in thread "main" java.lang.IllegalArgumentException: TokenStream fields cannot be stored
If I change the TextField.TYPE_STORED
to TYPE_NOT_STORED
there's no exception. However, the content of the field is null
. There's a constructor for Field
which clearly accepts TokenStream
object.
I can manually extract the tokens from the procStream
with .incrementToken()
and .getAttribute(ChatTermAttribute.class)
.
My question: How can I pass the TokenStream
to the Field object?
Upvotes: 2
Views: 496
Reputation: 33341
You can't just pass in a TokenStream and store the field.
A TokenStream is a stream of analyzed, indexable tokens. The stored content of a field is the pre-analysis string. You are not providing that string to the field, so it doesn't have anything suitable to be stored, thus the exception.
Instead, it would be more typical to set the Analyzer
in the IndexWriterConfig
, and let it handle analyzing the field for you. I'm guessing the reason you are doing it this way instead of letting the IndexWriter handle it is because you want to add that CustomFilter
to an out-of-the-box analyzer. Instead, just create your own custom Analyzer
. Analyzers are easy. Just copy the source of the analyzer you want to use, and add your custom filter to the chain in createComponents
. Say your using StandardAnalyzer, then you'd change the incrementToken method you copied to look like this:
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new StopFilter(tok, stopwords);
tok = new CustomFilter(tok, opts); //Just adding this line
return new TokenStreamComponents(src, tok) {
@Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
Then you can create your field like:
new Field("content", stringToProcess, ft);
Okay, so I've assumed this is a bit of an XY problem. With the caveat that creating a custom analyzer is very likely the better solution, you actually can pass a TokenStream to the Field and store it as well, you just need to provide the string to store as well as the tokenstream. That would look something like this:
Field myField = new Field("content", stringToProcess, ft);
myField.setContentStream(procStream);
Upvotes: 2