Reputation: 604
My original issue is that I am trying to serialize a string containing html tags to an XML element.
hello <a href="world.php">World</a>, this
is
a nice
test
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
However, I have 2 issues
<p>
tags, no <br>
tags).
Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:
string result = "<p>" + text
.Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
.Replace(Environment.NewLine, "<br />")
.Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";
However, this does not in all cases generate valid html. In the example above, it would create <br />
s between the <li>
tags or cause <ul>
tags within <p>
tags - which is both not allowed.
Target would be to have a result like the following (line breaks are only for better readability and don't matter here)
<p>hello <a href="world.php">World</a>, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?
Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.
Thank you!
EDIT: Clearly separated the 2 issues
EDIT2: No influence on deserialization
EDIT3: Added target output
Upvotes: 1
Views: 1715
Reputation: 435
I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:
1) Replace line breaks with HTML line breaks
string result = text.Replace(Environment.NewLine, "<br />");
2) Use the HTMLAgility pack to fix any invalid HTML
var doc = new HtmlDocument();
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
doc.OptionFixNestedTags = false;
doc.LoadHtml(result);
if (doc.ParseErrors.Count() > 0)
{
// throw error
}else{
// get fixed html
result= doc.DocumentNode.OuterHtml;
}
Upvotes: 0
Reputation: 151594
What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.
You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.
Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.
Upvotes: 3