Reputation:

Using C# regular expressions to remove HTML tags

How do I use C# regular expression to replace/remove all HTML tags, including the angle brackets? Can someone please help me with the code?

Upvotes: 150

Answers (11)

srjheam

Reputation: 35

Seems like @JasonTrue 's answer is not working anymore due to the "//body//text()" XPath.

Acessing all the document's child nodes and then filtering out the empty text tags may be the way.

public static string StripInnerText(string html)
{
    if (string.IsNullOrEmpty(html))
    return null;

    HtmlAgilityPack.HtmlDocument doc = new();
    doc.LoadHtml(html);

    if (doc is null)
        return string.Empty;

    var texts = doc.DocumentNode.ChildNodes
        .Select(node => node.InnerText)
        .Where(text => !string.IsNullOrWhiteSpace(text))
        .Select(text => text.Trim())
        .ToList();

    var output = string.Join(Environment.NewLine, texts);

    string textOnly = HttpUtility.HtmlDecode(output.ToString());

    return textOnly;
}

Test it with the following fiddle: https://dotnetfiddle.net/NQC2Y5

Sorry for posting a new answer, it is because I don't have 50 reputation at the moment and this question and all the answers here was so useful for me that I felt like I have the duty to contribute.

Upvotes: 0

JasonTrue

Reputation: 19654

The correct answer is don't do that, use the HTML Agility Pack.

Edited to add:

To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
   output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());

There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. You can get part way there with a RegEx, but you'll need to do manual verifications.

Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.

A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery.

Upvotes: 85

AnisNoorAli

Reputation: 81

Use this method to remove tags:

public string From_To(string text, string from, string to)
{
    if (text == null)
        return null;
    string pattern = @"" + from + ".*?" + to;
    Regex rx = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    MatchCollection matches = rx.Matches(text);
    return matches.Count <= 0 ? text : matches.Cast<Match>().Where(match => !string.IsNullOrEmpty(match.Value)).Aggregate(text, (current, match) => current.Replace(match.Value, ""));
}

Upvotes: -2

GRUNGER

Reputation: 496

Add .+? in <[^>]*> and try this regex (base on this):

<[^>].+?>

c# .net regex demo

Upvotes: 2

Daniel Brückner

Reputation: 59705

As often stated before, you should not use regular expressions to process XML or HTML documents. They do not perform very well with HTML and XML documents, because there is no way to express nested structures in a general way.

You could use the following.

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

This will work for most cases, but there will be cases (for example CDATA containing angle brackets) where this will not work as expected.

Upvotes: 179

Owidat

Reputation: 1081

try regular expression method at this URL: http://www.dotnetperls.com/remove-html-tags

/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}

/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}

Upvotes: 6

zzzzBov

Reputation: 179256

@JasonTrue is correct, that stripping HTML tags should not be done via regular expressions.

It's quite simple to strip HTML tags using HtmlAgilityPack:

public string StripTags(string input) {
    var doc = new HtmlDocument();
    doc.LoadHtml(input ?? "");
    return doc.DocumentNode.InnerText;
}

Upvotes: 20

Swaroop

Reputation: 49

use this..

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

Upvotes: 4

CountZero

Reputation: 6399

I would like to echo Jason's response though sometimes you need to naively parse some Html and pull out the text content.

I needed to do this with some Html which had been created by a rich text editor, always fun and games.

In this case you may need to remove the content of some tags as well as just the tags themselves.

In my case and tags were thrown into this mix. Some one may find my (very slightly) less naive implementation a useful starting point.

   /// <summary>
    /// Removes all html tags from string and leaves only plain text
    /// Removes content of <xml></xml> and <style></style> tags as aim to get text content not markup /meta data.
    /// </summary>
    /// <param name="input"></param>
    /// <returns></returns>
    public static string HtmlStrip(this string input)
    {
        input = Regex.Replace(input, "<style>(.|\n)*?</style>",string.Empty);
        input = Regex.Replace(input, @"<xml>(.|\n)*?</xml>", string.Empty); // remove all <xml></xml> tags and anything inbetween.  
        return Regex.Replace(input, @"<(.|\n)*?>", string.Empty); // remove any tags but not there content "<p>bob<span> johnson</span></p>" becomes "bob johnson"
    }

Upvotes: 14

Alan Moore

Reputation: 75262

The question is too broad to be answered definitively. Are you talking about removing all tags from a real-world HTML document, like a web page? If so, you would have to:

remove the <!DOCTYPE declaration or <?xml prolog if they exist
remove all SGML comments
remove the entire HEAD element
remove all SCRIPT and STYLE elements
do Grabthar-knows-what with FORM and TABLE elements
remove the remaining tags
remove the <![CDATA[ and ]]> sequences from CDATA sections but leave their contents alone

That's just off the top of my head--I'm sure there's more. Once you've done all that, you'll end up with words, sentences and paragraphs run together in some places, and big chunks of useless whitespace in others.

But, assuming you're working with just a fragment and you can get away with simply removing all tags, here's the regex I would use:

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

Matching single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values. I don't see any need to explicitly match the attribute names and other stuff inside the tag, like the regex in Ryan's answer does; the first alternative handles all of that.

In case you're wondering about those (?>...) constructs, they're atomic groups. They make the regex a little more efficient, but more importantly, they prevent runaway backtracking, which is something you should always watch out for when you mix alternation and nested quantifiers as I've done. I don't really think that would be a problem here, but I know if I don't mention it, someone else will. ;-)

This regex isn't perfect, of course, but it's probably as good as you'll ever need.

Upvotes: 39

Ryan Emerle

Reputation: 15821

Regex regex = new Regex(@"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline);

Source

Upvotes: 28

Using C# regular expressions to remove HTML tags

Answers (11)

Related Questions