Daniel Peñalba
Daniel Peñalba

Reputation: 31897

Get plain text from HTML in .NET

What is the best way to get a plain text string from an HTML string?

public string GetPlainText(string htmlString)
{
    // any .NET built in utility?
}

Thanks in advance

Upvotes: 31

Views: 38965

Answers (5)

Alex
Alex

Reputation: 5449

There is no built-in solution in the framework.

If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.

Upvotes: 5

user1641172
user1641172

Reputation:

Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

Return HttpUtility.HtmlDecode(
                Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                )

This removes all the tags, and then decodes any of the extras like &lt; or &gt;

Upvotes: 1

Alex K.
Alex K.

Reputation: 175936

You can use MSHTML, which can be pretty forgiving;

//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });

string txt = htmldoc2.body.outerText;

Plateau of Leng 2 sugars please what? & who?

Upvotes: 46

Rudi Visser
Rudi Visser

Reputation: 22019

There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");

Upvotes: 25

Erick Petrucelli
Erick Petrucelli

Reputation: 14912

There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.

If you need to remove more than just the tags (i.e., turn &acirc; to â), you can use a more elaborated solution, like found here.

Upvotes: 0

Related Questions