nogenius
nogenius

Reputation: 604

Using C# to convert incorrect html string to real html

My original issue is that I am trying to serialize a string containing html tags to an XML element.

hello <a href="world.php">World</a>, this

is
a nice
test

<ul>
  <li>to demonstrate my issue</li>
  <li>and find a solution</li>
</ul>

However, I have 2 issues

  1. Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.


2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags). Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:

    string result = "<p>" + text
     .Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
     .Replace(Environment.NewLine, "<br />")
     .Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";

However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.

Target would be to have a result like the following (line breaks are only for better readability and don't matter here)

<p>hello <a href="world.php">World</a>, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
  <li>to demonstrate my issue</li>
  <li>and find a solution</li>
</ul>

Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?

Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.

Thank you!

EDIT: Clearly separated the 2 issues

EDIT2: No influence on deserialization

EDIT3: Added target output

Upvotes: 1

Views: 1715

Answers (2)

Jack
Jack

Reputation: 435

I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:

1) Replace line breaks with HTML line breaks

string result = text.Replace(Environment.NewLine, "<br />");

2) Use the HTMLAgility pack to fix any invalid HTML

    var doc = new HtmlDocument();
    HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
    doc.OptionFixNestedTags = false;
    doc.LoadHtml(result);

    if (doc.ParseErrors.Count() > 0)
    {
                // throw error
    }else{
                // get fixed html
                 result= doc.DocumentNode.OuterHtml;
    }

Upvotes: 0

CodeCaster
CodeCaster

Reputation: 151594

What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.

You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.


Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.

Upvotes: 3

Related Questions