Invalid_Path
Invalid_Path

Reputation: 341

How to get a TokenStream of a document's field to be used for highlighting?

The Problem

I'm currently working on a project using Lucene 8.1.0, pure Lucene-not Solr. I would like to add highlighting to the results that get displayed based on the user query. The problem i'm facing is that i can't figure out a way to get a TokenStream of a field on a specified document. The field i'm trying to access is indexed using term vectors, along with other FieldType parameters.

What I tried

The official documentation on the TokenSources class lists almost all of the methods used in the past, as deprecated. I've searched far and wide, all the guides/posts I've found are relatively old and all of them use TokenSources and one of the deprecated methods. I'd be willing to use the function getTermVectorTokenStreamOrNull() but I don't understand how to use the Fields parameter and what to pass there. (I can't instantiate a Fields object, since it's abstract, and none of the direct known subclasses make sense to me or how to use them).

The current solution i have is to get a TokenStream the following way:

String text = hit.get(field.label);
Analyzer analyzer = new ClassicAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(field.label, text);
TextFragment[] fragments = highlighter.getBestTextFragments(tokenStream, text, false, 5);

What I actually need help is, how to use the term vectors of a field and how to get a TokenStream for setting up the Highlighter.
If you think this is wrong and I should use the getBestFragments​(Analyzer analyzer, String fieldName, String text, int maxNumFragments) method from here, then that means I won't be needing the term vectors. But from what I can gather, using term vectors helps with performance on search times with the trade-off of index size. What is your advice?
Thanks in advance!

Upvotes: 1

Views: 173

Answers (2)

Vuk Djapic
Vuk Djapic

Reputation: 886

It is explained in Lucene javadoc https://lucene.apache.org/core/8_0_0/highlighter/org/apache/lucene/search/highlight/TokenSources.html#getTermVectorTokenStreamOrNull-java.lang.String-org.apache.lucene.index.Fields-int- .

tvFields - from IndexReader.getTermVectors(int). Possibly null.

So all you need is IndexReader to get Fields. And int parameter is docID, which you get from your search result, as topDocs.scoreDocs[i].doc value.

Upvotes: 0

Invalid_Path
Invalid_Path

Reputation: 341

Found the solution to my problem. The thing I was missing was TokenStreamFromVector class. TokenStreamFromVector extends TokenStream so I'm able to plug it in the getBestFragments() method.
Gonna leave this up for anyone lost and looking for the same thing. Looking through the USE tabs really helped but I don't know why TokenStreamFromVector isn't linked as a subclass in the TokenStream page.
(I know it's in a different package but still, there's no way to reach the TokenStreamFromVector page quickly and through the normal workflow.)

Upvotes: 1

Related Questions