HtmlAgilityPack td.innertext bug?

Question

I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.

This is a C# app in VS2010. Running in Debug, I see the string in my class begins:

Animal and vegetable oils  1 < 5 MW 
5-50 MW  30

But when I assign with:

td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();

The td.InnerHtml shows

Animal and vegetable oils  1 < 5=\"\" mw=\"\">
5-50 MW  30

Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)

Jamie Treworgy · Accepted Answer

HTML Agility Pack's HTML parser is treating the < as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the which forces it to close the tag.

The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.

An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.

HtmlAgilityPack td.innertext bug?

Answers (1)

Related Questions