Reputation:
I am trying to highlight search terms in a block of HTML, the problem is if a user does a search for "color", this:
<span style='color: white'>White</span>
becomes: <span style='<b>color</b>: white'><b>White</b></span>
and obviously, messing up my style is not a good idea.
Here is the code I am using:
Query parsedQuery = parser.Parse(luceneQuery);
StandardAnalyzer Analyzer = new StandardAnalyzer();
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b class='search'>", "</b>");
QueryScorer scorer = new QueryScorer(parsedQuery);
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter());
Highlighter.GetBestFragment(Analyzer, propertyName, invocation.ReturnValue.ToString())
I'm guessing the problem is that i need a different Fragmenter, but I'm not sure. Any help would be appreciated.
Upvotes: 9
Views: 1354
Reputation:
I think I figured it out...
I subclassed StandardAnalyzer and changed TokenStream to this:
public override Lucene.Net.Analysis.TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
var start = base.TokenStream(fieldName, reader);
HtmlStripCharFilter filter = new HtmlStripCharFilter(reader);
TokenStream result = new StandardFilter(filter);
return new StopFilter(new LowerCaseFilter(result), this.stopSet);
}
and Implemented HtmlStripCharFilter as:
public class HtmlStripCharFilter : Lucene.Net.Analysis.CharTokenizer
{
private bool inTag = false;
public HtmlStripCharFilter(TextReader input)
: base(input)
{
}
protected override bool IsTokenChar(char c)
{
if (c == '<' && inTag == false)
{
inTag = true;
return false;
}
if (c == '>' && inTag)
{
inTag = false;
return false;
}
return !inTag && !Char.IsWhiteSpace(c);
}
}
It's headed in the right direction, but still needs a lot more work before it's done. If anyone has a better solution (read "TESTED" solution) I would love to hear it.
Upvotes: 3