Capn Jack
Capn Jack

Reputation: 1241

HTML Agility pack is changing </p> tag to <p> on invalid markup

With an input of:

<head><title>Title</title></head>
<font face="Verdana" size="2">
<p>

<b>Bold sentence.</b>
<br><br>Sentence after two  breaks.<br><br>Sentence after another two  breaks. <b><i>bold and italicized sentence.</i></b> sentence. <br><br>final sentence after two more breaks.

</font></p>

<form><center><div style='padding-left: 16px; padding-right: 16px;'><a class='button' href='javascript:void(0);' onclick='javascript:window.close()'><img src='/GBUIAssets/Web20/img/frame/buttonshade.png' alt='buttonShade' /><span class='roundLeft'><span class='roundRight'>Fermer</span></span></a></div></center></form></font>

im removing the head, font and form. And the output I get is:

<p>

<b>Bold sentence.</b>
<br><br>Sentence after two  breaks.<br><br>Sentence after another two  breaks. <b><i>bold and italicized sentence.</i></b> sentence. <br><br>final sentence after two more breaks.

<p>

This is problematic because I'm trying to convert it to xml after and this will throw an error. Why is it "fixing" a part of my code that's already valid? Any ideas what could be causing it? I can supply more code if needed, but I just want to make sure first that there's no obvious step I'm missing.

EDIT: for the sake of full context, I'm stripping the html for its body content. Catch is, this HTML is HIDEOUS. Really really ill formatted. I'm loading it into xml to throw the specific errors that are wrong with the html doc and spitting that into an error report for each file that failed to strip

Upvotes: 0

Views: 262

Answers (2)

Conan
Conan

Reputation: 2709

Update your markup to:

<head>
  <title>Title</title>
</head>
<font face="Verdana" size="2">
<p>

<b>Bold sentence.</b>
<br/><br/>Sentence after two  breaks.<br/><br/>Sentence after another two  breaks. <b><i>bold and italicized sentence.</i></b> sentence. <br/><br/>final sentence after two more breaks.

</p>

<form>
<center>
<div style='padding-left: 16px; padding-right: 16px;'>
<a class='button' href='javascript:void(0);' onclick='javascript:window.close()'>
<img src='/GBUIAssets/Web20/img/frame/buttonshade.png' alt='buttonShade' />
<span class='roundLeft'><span class='roundRight'>Fermer</span></span>
</a>
</div>
</center>
</form>
</font>

If possible, I'd recommend moving the <font> declaration into an external stylesheet, e.g.

body { font-face: Verdana; }

Upvotes: 0

InvincibleM
InvincibleM

Reputation: 519

Marpup is invalid. Try putting the font tag in side the P tag and you should be fine.

Upvotes: 0

Related Questions