Reputation: 15682
Probably most people reading the title who know a bit about Lucene won't need much further explanation. NB I use Jython but I think most Java users will understand the Java equivalent...
It's a classic thing to want to do: you have more than one term in your search string... in Lucene terms this returns a BooleanQuery. Then you use something like this code to highlight (NB I am a Lucene newbie, this is all closely tweaked from Net examples):
yellow_highlight = SimpleHTMLFormatter( '<b style="background-color:yellow">', '</b>' )
green_highlight = SimpleHTMLFormatter( '<b style="background-color:green">', '</b>' )
...
stream = FrenchAnalyzer( Version.LUCENE_46 ).tokenStream( "both", StringReader( both ) )
scorer = QueryScorer( fr_query, "both" )
fragmenter = SimpleSpanFragmenter(scorer)
highlighter = Highlighter( yellow_highlight, scorer )
highlighter.setTextFragmenter(fragmenter)
best_fragments = highlighter.getBestTextFragments( stream, both, True, 5 )
if best_fragments:
for best_frag in best_fragments:
print "=== best frag: %s, type %s" % ( best_frag, type( best_frag ))
html_text += "&bull %s<br>\n" % unicode( best_frag )
... and then the html_text is put in a JTextPane for example.
But how would you make the first word in your query highlight with a yellow background and the second word highlight with a green background? I have tried to understand the various classes in org.apache.lucene.search... to no avail. So my only way of learning was googling. I couldn't find any clues...
Upvotes: 1
Views: 1253
Reputation: 15682
I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument
. There's also the interface org.w3c.dom.html.HTMLDocument
in the standard Java library. This way is hard work.
But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter
returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B
tag. That's it. It's not even a "proper" HTML fragment, just a String
with <B>
s and </B>
s in it.
A multi-word query generates a BooleanQuery
... from which you can extract multiple TermQuery
s by going booleanQuery.clauses()
... getQuery()
I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.
So you set up a map before to hold your "markup details":
def markupDetails = [:]
Then for each TermQuery
, you call this, with the same text
param each time, stipulating a different colour
param for each term. NB I'm using Lucene 6.
def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
def termQueryScorer = new QueryScorer( tq )
def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
// not sure under what circs you get > 1 fragment...
assert frags.size() <= 1
// NB you don't always get all terms in all returned LDocuments...
if( frags.size() ) {
String highlightedFrag = frags[ 0 ]
Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
def pos = 0
def previousEnd = 0
while( boldTagMatcher.find()) {
pos += boldTagMatcher.start() - previousEnd
previousEnd = boldTagMatcher.end()
markupDetails[ pos ] = boldTagMatcher.group() == '<B>'? colour : ConsoleColors.RESET
}
}
}
As I said, I wanted to colourise console output. The colour
parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m
. ConsoleColors.RESET
is \033[0m
and marks the place where each coloured bit of text stops.
... after you've finished doing this with all TermQuery
s you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String
. NB here text
is your original unmarked-up String
:
markupDetails.sort().reverseEach{ pos, markup ->
String firstPart = text.substring( 0, pos )
String secondPart = text.substring( pos )
text = firstPart + markup + secondPart
}
... at the end of which text
contains your marked-up String
: print to console. Lovely.
Upvotes: 1