Lucene Highlighter class: highlight different words in different colors

Question

Probably most people reading the title who know a bit about Lucene won't need much further explanation. NB I use Jython but I think most Java users will understand the Java equivalent...

It's a classic thing to want to do: you have more than one term in your search string... in Lucene terms this returns a BooleanQuery. Then you use something like this code to highlight (NB I am a Lucene newbie, this is all closely tweaked from Net examples):

yellow_highlight = SimpleHTMLFormatter( '', '' )
green_highlight = SimpleHTMLFormatter( '', '' )

...

stream = FrenchAnalyzer( Version.LUCENE_46 ).tokenStream( "both", StringReader( both ) )
scorer = QueryScorer( fr_query, "both" )
fragmenter = SimpleSpanFragmenter(scorer)
highlighter = Highlighter( yellow_highlight, scorer )
highlighter.setTextFragmenter(fragmenter)
best_fragments = highlighter.getBestTextFragments( stream, both, True, 5 )
if best_fragments:
    for best_frag in best_fragments:
        print "=== best frag: %s, type %s" % ( best_frag, type( best_frag ))
        html_text += "&bull %s

" % unicode( best_frag )

... and then the html_text is put in a JTextPane for example.

But how would you make the first word in your query highlight with a yellow background and the second word highlight with a green background? I have tried to understand the various classes in org.apache.lucene.search... to no avail. So my only way of learning was googling. I couldn't find any clues...

mike rodent · Accepted Answer

I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument. There's also the interface org.w3c.dom.html.HTMLDocument in the standard Java library. This way is hard work.

But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B tag. That's it. It's not even a "proper" HTML fragment, just a String with s and s in it.

A multi-word query generates a BooleanQuery... from which you can extract multiple TermQuerys by going booleanQuery.clauses() ... getQuery()

I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.

So you set up a map before to hold your "markup details":

def markupDetails = [:]

Then for each TermQuery, you call this, with the same text param each time, stipulating a different colour param for each term. NB I'm using Lucene 6.

def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
    def termQueryScorer = new QueryScorer( tq )
    def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
    TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
    String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
    // not sure under what circs you get > 1 fragment...
    assert frags.size() <= 1
    // NB you don't always get all terms in all returned LDocuments... 
    if( frags.size() ) {
        String highlightedFrag = frags[ 0 ]
        Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
        def pos = 0
        def previousEnd = 0
        while( boldTagMatcher.find()) {
            pos += boldTagMatcher.start() - previousEnd
            previousEnd =  boldTagMatcher.end()
            markupDetails[  pos ] = boldTagMatcher.group() == ''? colour : ConsoleColors.RESET
        }
    }
}

As I said, I wanted to colourise console output. The colour parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m. ConsoleColors.RESET is \033[0m and marks the place where each coloured bit of text stops.

... after you've finished doing this with all TermQuerys you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String. NB here text is your original unmarked-up String:

markupDetails.sort().reverseEach{ pos, markup -> String firstPart = text.substring( 0, pos ) String secondPart = text.substring( pos ) text = firstPart + markup + secondPart }

... at the end of which text contains your marked-up String: print to console. Lovely.

Lucene Highlighter class: highlight different words in different colors

Answers (1)

Related Questions