Cory Dee
Cory Dee

Reputation: 2891

Strip HTML, Keep Bold/Strong Tags

I have a small bit of regex that strips out all HTML, and works great. What I need to do now, is strip out all HTML, but KEEP the <b> and <strong> tags in tact.

Any help would be greatly appreciated.

shortDesc = System.Text.RegularExpressions.Regex.Replace(shortDesc, @"<[^>]*>", String.Empty);

Thanks!

Upvotes: 1

Views: 1665

Answers (3)

ridgerunner
ridgerunner

Reputation: 34385

Here is a simple extension of your regex that should work pretty well:

Regex re = new Regex(@"<(?!/?(?:strong|b)\b)[^>]*>",
    RegexOptions.IgnoreCase);
text = re.Replace(text, "");

Upvotes: 2

MrCC
MrCC

Reputation: 713

From what I gathered in your comments, a careful usage of regular expressions (though usually shunned for obvious reasons) could be employed, provided that you meet the following requirement:

  1. The HTML is not malformed.
  2. It won't contain "<" and ">" as part of *anything other than opening / closing tags *.

If the html page is under your control, it is fairly reasonable to assume that you could meet both conditions, otherwise I wouldn't bother.

In your case, you can use the overloaded instance of the Replace method that accepts a MatchEvaluator delegate along with its other parameters.

Usage example:

MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler);
Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
string newString = RE.Replace(source, replaceCallback);

MatchHandler example:

public static string MatchHandler(Match theMatch) {
  if (theMatch.Value.StartsWith("<b>") || ...) {
    return theMatch.Value;  //return as is
  }
  //else return empty string
  return "";
}

Upvotes: 0

Quentin
Quentin

Reputation: 943143

  1. Stop trying to parse HTML with a regular expression
  2. Use something like HTML Agility Pack

Upvotes: 4

Related Questions