Parsing HTML using HTMLAgilityPack

Question

I have the following HTML that I'm trying to parse using the HTML Agility Pack.

This is a snippet of the whole file that is returned by the code:


    text here text here text 
    text here text here text text here text here text text here text here text text here text here text 
    
        
            
                
                    
                       
                    
                    caption text
                
            
        
    
    text here text here text text here text here text text here text here text text here text here text 
    text here text here text text here text here text text here text here text text here text here text text here text here text 
    text here text here text text here text here text text here text here text text here text here text text here text here text

I get this snippet of code using the following (which is messy i know)

string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);

var links = document.DocumentNode
        .Descendants("div")
        .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
        .SelectMany(div => div.Descendants("p"))
        .ToList();
int cn = links.Count;

HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    textBox1.AppendText(node.InnerText.Trim());
    textBox1.AppendText(System.Environment.NewLine);
}

The code loops through each p and (for now) appends it to a textbox. All is working correctly other than the div tag with the class gallery clr bdr aln-c js-no-shadow mod cld. The result of this bit of HTML is that I get the and caption text bits.

what's the best way to omit that from the results?

Simon Mourier · Accepted Answer

XPATH is your friend. Try this and forget about that crappy xlink syntax :-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

This expression will select all P nodes that don't have any attributes set. See here for other samples: XPath Syntax

Parsing HTML using HTMLAgilityPack

Answers (2)

Related Questions