Reputation: 2518
I'm currently coding a desktop application in c# which also has to handle XHTML document manipulation. For that purpose I'm using the Html Agility Pack which seemed to be okay so far. After carefully checking the output from Html Agility Pack I found out that the code isn't well formed xhtml any more.
It removes self-closing tags (slash) and overwrites other proprietary code elements...
eg. input html code:
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)" />
eg. output html code
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)">
(removed the trailing slash...)
Another example is with proprietary code elements (for Mikrotik hotspot devices):
eg input html code
<form action="$(link-login-only)" method="post" name="login" $(if chap-id) onSubmit="return doLogin()"$(endif)>
The $(if chap-id)
, $(endif)
and $(link-login-only)
parts are custom code fragments interpreted from the Mikrotik device.
eg. output html code after Html Agility Pack (which transforms it to unuseable code)
<form action="$(link-login-only)" method="post" name="login" $(if="" chap-id)="" onsubmit="return doLogin()" $(endif)="">
Has someone an idea how to "instruct" Html Agility Pack to output well formed XHTML and to ignore "custom code" fragments (is this possibly via Regex)?
Thanks in advance! :-)
Upvotes: 1
Views: 2355
Reputation: 82406
Necromancing.
Problem 1 is because you probably didn't specify OptionOutputAsXml = true
, meaning HtmlAgilityPack outputs HTML instead of XHTML.
Actually, doing this is rather clever, as it reduces the file size.
If you need XHTML, you need to specifically instruct HtmlAgilityPack to output XHTML (XML), not HTML (SGML).
SGML allows for tags with no closing tag (/>
), while XML does not.
Also, you don't have to wrap your script and style tags inside cdata in SGML (common JS/CSS code creates invalid XML, e.g. comparison/and operators).
To fix this:
public static void BeautifyHtml()
{
string input = "<html><body><p>This is some test test<br ><ul><li>item 1<li>item2<</ul></body>";
HtmlAgilityPack.HtmlDocument test = new HtmlAgilityPack.HtmlDocument();
test.LoadHtml(input);
test.OptionOutputAsXml = true;
test.OptionCheckSyntax = true;
test.OptionFixNestedTags = true;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
using (System.IO.TextWriter stringWriter = new System.IO.StringWriter(sb))
{
test.Save(stringWriter);
}
string beautified = sb.ToString();
System.Console.WriteLine(beautified);
}
A word of caution:
HTMLAgilityPack doesn't handle encoding properly.
Meaning, if you save it to a stream, it will use the encoding-tags as-is if they are present in the HTML, instead of changing them to the ones used by the stream...
This can create a lot of (un)"happiness".
Just FYI. You have now been warned.
Note:
style/script must be wrapped inside a commented out cdata section for valid XML.
<style type="text/css">
/*<![CDATA[*/
div > span {
background-color: yellow;
}
/*]]>*/
</style>
<script type="text/javascript">
//<![CDATA[
if(1 < 2)
document.title = "Foo & Bar";
//]]>
</script>
Upvotes: 3
Reputation: 24344
An alternative is CsQuery which, at least for the simple cases you've got here, will leave your pre-processor tags alone by nature of just treating them like valueless attributes. That is, HAP appears to convert any attribute someattribute
without a value to someattribute=""
. CsQuery won't do this.
However the observations @Justin Niessner makes about your markup are going to be true for any parser that is not specifically designed to parse the templating code you have in there. Just because this one example makes it through CsQuery is no guarantee some other format won't result in something that's not a valid attribute name, or if not valid, at least acceptable to an HTML5 parser.
If you need to manipulate something as HTML, then do it after templating. If you need to manipulate it before the templating engine has at it, then you're in a catch 22, since it's not HTML yet. Or alternatively you could use a templating system that uses valid HTML markup for its keywords (example: Knockout).
Upvotes: 0
Reputation: 245509
In your first example, HTML Agility Pack is actually fixing your markup. The input element is a void element. Since there is no context inside, it needs no closing tag.
HTML Agility Pack is made for parsing valid HTML markup, not markup embedded with custom code. In your first example, the custom markup is inside quotes therefore isn't an issue. In your second example, the variables are outside quotes.
HTML Agility Pack tries to parse them as regular (but malformed) attributes of the element. There's no way to fix that. You'll have to find another way to parse your markup if you need support for custom code inside the markup.
Upvotes: 3