Reputation: 5836
I am using HtmlAgilityPack
1.11.18 under .Net Core 2.2.
I want to remove all HTML attributes from <p>
nodes in an HTML fragment (not a complete document).
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(input);
var pNodes = htmlDoc.DocumentNode.SelectNodes("//p");
foreach (var node in pNodes)
{
node.Attributes.Remove();
}
return htmlDoc.Text;
This is not doing the trick, am I missing something? The method returns a string
which should be the fragment minus the attributes on all <p>
elements.
I realize you are not supposed to use RegEx to parse HTML but these are small fragments and I would prefer a RegEx method so I can remove the dependency on HtmlAgilityPack
, which I only brought in to handle this cleanly.
Upvotes: 0
Views: 741
Reputation: 464
I would prefer a RegEx method so I can remove the dependency on HtmlAgilityPack, which I only brought in to handle this cleanly.
So why not using it for such a task? It sounds like You just want to change <p[^>]*>
to <p>
*
This is not doing the trick, am I missing something?
Yes. HtmlDocument
class is more like bacis class that holds everything that the HTML Agility Pack needs to know about the document before parsing it and any change inside DOM structure that it holds won't be reflected here. I've always tend to use: return htmlDoc.DocumentNode.WriteTo();
as a "the most proper"way instead of returning htmlDoc.Text
.
Try this example below:
private static string foo()
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("<div><p class=\"ok\">text</p></div>");
var pNodes = htmlDoc.DocumentNode.SelectNodes("//p");
foreach (var node in pNodes)
{
node.Attributes.Remove();
}
return htmlDoc.DocumentNode.WriteTo();
}
*As @Progman mentioned it is a bad idea, here is the example why:
<div><p class=\"ok\" <!-- comment-->>text</p></div>
(so You can put anything in the comment, regex wouldn't handle that)<div><p></p><!-- comment-->>text</div>
Upvotes: 1