Reputation: 31897
What is the best way to get a plain text string from an HTML string?
public string GetPlainText(string htmlString)
{
// any .NET built in utility?
}
Thanks in advance
Upvotes: 31
Views: 38965
Reputation: 5449
There is no built-in solution in the framework.
If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.
Upvotes: 5
Reputation:
Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.
Return HttpUtility.HtmlDecode(
Regex.Replace(HtmlString, "<(.|\n)*?>", "")
)
This removes all the tags, and then decodes any of the extras like <
or >
Upvotes: 1
Reputation: 175936
You can use MSHTML, which can be pretty forgiving;
//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? & who?" });
string txt = htmldoc2.body.outerText;
Plateau of Leng 2 sugars please what? & who?
Upvotes: 46
Reputation: 22019
There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:
string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");
Upvotes: 25
Reputation: 14912
There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.
If you need to remove more than just the tags (i.e., turn &acirc; to â), you can use a more elaborated solution, like found here.
Upvotes: 0