Andrew
Andrew

Reputation: 10033

Removing all elements from HTML that have given class using Agility Pack

I'm trying to select all elements that have a given class and remove them from a HTML string.

This is what I have so far it doesn't seem to remove anything although the source shows clearly 4 elements with that class name.

// Filter page HTML to display required content
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// filePath is a path to a file containing the html
htmlDoc.LoadHtml(pageHTML);

// ParseErrors is an ArrayList containing any errors from the Load statement);
if (!htmlDoc.ParseErrors.Any())
{
    // Remove all elements marked with pdf-ignore class
    HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//body[@class='pdf-ignore']");

    // Remove the collection from above
    foreach (var node in nodes)
    {
        node.Remove();
    }
}

EDIT: Just to clarify the document is parsing and the SelectNodes line is being hit, just not returning anything.

Here is a snippet of the html:

<input type=\"submit\" name=\"ctl00$MainContent$PrintBtn\" value=\"Print Shotlist\" onclick=\"window.print();\" id=\"MainContent_PrintBtn\" class=\"pdf-ignore\">

Upvotes: 0

Views: 1079

Answers (2)

Oleks
Oleks

Reputation: 32323

EDIT: in your updated answer you posted a part of the HTML string an <input> element declaration, but you're trying to match a <body> element with the class pdf-ignore (according to your expression //body[@class='pdf-ignore']).

If you want to match all the elements from the document with this class you should use:

var nodes = htmlDoc.DocumentNode.SelectNodes("//*[contains(@class,'pdf-ignore')]");

code to get your nodes. This will match all the elements with the class name specified.

Your code is seems to be correct except the one detail: the condition htmlDoc.ParseErrors == null. You select and remove nodes ONLY if the ParseErrors property (which is a type of IEnumerable<HtmlParseError>) is null, but actually if no errors found this property returns an empty list. So changing your code to:

if (!htmlDoc.ParseErrors.Any())
{
    // some logic here
}

should solve the issue.

Upvotes: 2

Nathan
Nathan

Reputation: 6216

Your xpath is probably not matching: have you tried "//div[class='pdf-ignore']" (no "@")?

Upvotes: 0

Related Questions