Reputation: 1747
I tried HtmlAgilityPack and the following code, but it does not capture text from html lists:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
HtmlNode node = doc.DocumentNode;
return node.InnerText;
Here is the code that fails:
<as html>
<p>This line is picked up <b>correctly</b>. List items hasn't...</p>
<p><ul>
<li>List Item 1</li>
<li>List Item 2</li>
<li>List Item 3</li>
<li>List Item 4</li>
</ul></p>
</as html>
Upvotes: 6
Views: 2036
Reputation: 1747
Following piece of code works for me:
string StripHTML(string htmlStr)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
var root = doc.DocumentNode;
string s = "";
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
s += text.Trim() + " ";
}
}
return s.Trim();
}
Upvotes: 3
Reputation: 16656
Because you need walk over tree and concat in some way InnerText
of all nodes
Upvotes: 3