Steve
Steve

Reputation: 40

HtmlAgilityPack td.innertext bug?

I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.

This is a C# app in VS2010. Running in Debug, I see the string in my class begins:

Animal and vegetable oils  1 < 5 MW <br>5-50 MW  30 <br>

But when I assign with:

td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();

The td.InnerHtml shows

Animal and vegetable oils  1 < 5=\"\" mw=\"\"><br>5-50 MW  30 <br>

Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)

Upvotes: 0

Views: 486

Answers (1)

Jamie Treworgy
Jamie Treworgy

Reputation: 24344

HTML Agility Pack's HTML parser is treating the < as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the <br> which forces it to close the tag.

The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.

An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.

Upvotes: 1

Related Questions