How can I extract text visible on a page from its html source?

Question

I tried HtmlAgilityPack and the following code, but it does not capture text from html lists:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
HtmlNode node = doc.DocumentNode;
return node.InnerText;

Here is the code that fails:


This line is picked up correctly.  List items hasn't...

List Item 1
List Item 2
List Item 3 
List Item 4

Luke G · Accepted Answer

Following piece of code works for me:

string StripHTML(string htmlStr)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlStr);
    var root = doc.DocumentNode;
    string s = "";
    foreach (var node in root.DescendantNodesAndSelf())
    {
        if (!node.HasChildNodes)
        {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
            s += text.Trim() + " ";                     
        }
    }
    return s.Trim();
}

How can I extract text visible on a page from its html source?

Answers (2)

Related Questions