Reputation: 40
I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.
This is a C# app in VS2010. Running in Debug, I see the string in my class begins:
Animal and vegetable oils 1 < 5 MW <br>5-50 MW 30 <br>
But when I assign with:
td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();
The td.InnerHtml shows
Animal and vegetable oils 1 < 5=\"\" mw=\"\"><br>5-50 MW 30 <br>
Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)
Upvotes: 0
Views: 486
Reputation: 24344
HTML Agility Pack's HTML parser is treating the <
as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the <br>
which forces it to close the tag.
The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.
An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.
Upvotes: 1