Reputation: 341
I'm currently working on a project using Lucene 8.1.0
, pure Lucene
-not Solr
. I would like to add highlighting to the results that get displayed based on the user query. The problem i'm facing is that i can't figure out a way to get a TokenStream
of a field on a specified document. The field i'm trying to access is indexed using term vectors, along with other FieldType
parameters.
The official documentation on the TokenSources class lists almost all of the methods used in the past, as deprecated. I've searched far and wide, all the guides/posts I've found are relatively old and all of them use TokenSources and one of the deprecated methods. I'd be willing to use the function getTermVectorTokenStreamOrNull()
but I don't understand how to use the Fields parameter and what to pass there. (I can't instantiate a Fields
object, since it's abstract, and none of the direct known subclasses make sense to me or how to use them).
The current solution i have is to get a TokenStream
the following way:
String text = hit.get(field.label);
Analyzer analyzer = new ClassicAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(field.label, text);
TextFragment[] fragments = highlighter.getBestTextFragments(tokenStream, text, false, 5);
What I actually need help is, how to use the term vectors of a field and how to get a TokenStream
for setting up the Highlighter
.
If you think this is wrong and I should use the getBestFragments(Analyzer analyzer, String fieldName, String text, int maxNumFragments)
method from here, then that means I won't be needing the term vectors. But from what I can gather, using term vectors helps with performance on search times with the trade-off of index size. What is your advice?
Thanks in advance!
Upvotes: 1
Views: 173
Reputation: 886
It is explained in Lucene javadoc https://lucene.apache.org/core/8_0_0/highlighter/org/apache/lucene/search/highlight/TokenSources.html#getTermVectorTokenStreamOrNull-java.lang.String-org.apache.lucene.index.Fields-int- .
tvFields - from IndexReader.getTermVectors(int). Possibly null.
So all you need is IndexReader
to get Fields
. And int parameter is docID, which you get from your search result, as topDocs.scoreDocs[i].doc
value.
Upvotes: 0
Reputation: 341
Found the solution to my problem. The thing I was missing was TokenStreamFromVector
class. TokenStreamFromVector
extends TokenStream
so I'm able to plug it in the getBestFragments()
method.
Gonna leave this up for anyone lost and looking for the same thing. Looking through the USE tabs really helped but I don't know why TokenStreamFromVector
isn't linked as a subclass in the TokenStream
page.
(I know it's in a different package but still, there's no way to reach the TokenStreamFromVector
page quickly and through the normal workflow.)
Upvotes: 1