Ron.Eng
Ron.Eng

Reputation: 489

Highlighting whole word in HTML string using C# regexp

I wrote a method that highlights keywords in an HTML string. It returns the updated string and a list of the matched keywords. I would like to match the word if it appears as a whole word or with dashes. But in case it appears with dashes, the word including the dashes is highlighted and returned.

For example, if the word is locks and the HTML contains He -locks- the door then the dashes around the word are also highlighted:

He <span style=\"background-color:yellow\">-locks-</span> the door.

Instead of:

He -<span style=\"background-color:yellow\">locks</span>- the door.

In addition, the returned list contains -locks- instead of locks.

What can I do to get my expected result?

Here is my code:

private static List<string> FindKeywords(IEnumerable<string> words, bool bHighlight, ref string text)
{
    HashSet<String> matchingKeywords = new HashSet<string>(new CaseInsensitiveComparer());

    string allWords = "\\b(-)?(" + words.Aggregate((list, word) => list + "|" + word) + ")(-)?\\b";
    Regex regex = new Regex(allWords, RegexOptions.Compiled | RegexOptions.IgnoreCase);

    foreach (Match match in regex.Matches(text))
    {
        matchingKeywords.Add(match.Value);
    }

    if (bHighlight)
    {
        text = regex.Replace(text, string.Format("<span style=\"background-color:yellow\">{0}</span>", "$0"));
    }

    return matchingKeywords.ToList();
}

Upvotes: 2

Views: 866

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627101

You need to use captured .Groups[2].Value instead of Match.Value because your regex has 3 capturing groups, and the second one contains the keyword that you highlight:

foreach (Match match in regex.Matches(text))
{
    matchingKeywords.Add(match.Groups[2].Value);
}

if (bHighlight)
{
    text = regex.Replace(text, string.Format("$1<span style=\"background-color:yellow\">{0}</span>$3", "$2"));
}

match.Groups[2].Value is used in the foreach and then $2 is the backreference to the keyword captured in the regex.Replace replacement string. $1 and $3 are the optional hyphens around the highlighted word (captured with (-)?).

Upvotes: 2

Related Questions