HtmlAgilityPack - Remove All Attributes

Question

I am using HtmlAgilityPack 1.11.18 under .Net Core 2.2.

I want to remove all HTML attributes from

nodes in an HTML fragment (not a complete document).

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(input);

var pNodes = htmlDoc.DocumentNode.SelectNodes("//p");

foreach (var node in pNodes)
{
    node.Attributes.Remove();
}

return htmlDoc.Text;

This is not doing the trick, am I missing something? The method returns a string which should be the fragment minus the attributes on all

elements.

I realize you are not supposed to use RegEx to parse HTML but these are small fragments and I would prefer a RegEx method so I can remove the dependency on HtmlAgilityPack, which I only brought in to handle this cleanly.

Eatos · Accepted Answer

I would prefer a RegEx method so I can remove the dependency on HtmlAgilityPack, which I only brought in to handle this cleanly.

So why not using it for such a task? It sounds like You just want to change ]*> to

*

This is not doing the trick, am I missing something?

Yes. HtmlDocument class is more like bacis class that holds everything that the HTML Agility Pack needs to know about the document before parsing it and any change inside DOM structure that it holds won't be reflected here. I've always tend to use: return htmlDoc.DocumentNode.WriteTo(); as a "the most proper"way instead of returning htmlDoc.Text.

Try this example below:

private static string foo()
{
    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml("text");

    var pNodes = htmlDoc.DocumentNode.SelectNodes("//p");

    foreach (var node in pNodes)
    {
        node.Attributes.Remove();
    }

    return htmlDoc.DocumentNode.WriteTo();
}

*As @Progman mentioned it is a bad idea, here is the example why:

Input: >text (so You can put anything in the comment, regex wouldn't handle that)
Output from HTML Agility Pack: >text

HtmlAgilityPack - Remove All Attributes

Answers (1)

Related Questions