Reputation: 229
Regular expression to match ">", "<", "&" chars that appear inside XML nodes
I have an almost indentical problem to this - however, I am using C#.
I'm not here to argue the validity of the XML.
What gets sent in is out of my control.
Input XML:
<PNODE>
<CNODE>This string contains > and < and & chars.</cnode>
</PNODE>
I need it to look like this:
<PNODE>
<CNODE>This string contains > and < and & chars.</CNODE>
</PNODE>
It looks like the guy found a solution for PHP- which doesn't help me.
However, I need to find a way escape the &, > and < characters inside the node, but leave the tag declarations alone.
Upvotes: 2
Views: 9937
Reputation: 28004
I'm not here to argue the validity of the XML.
As with that other question, the right answer is that what you got sent is not XML. It's a question of well-formedness, not a question of validity in the XML sense.
What gets sent in is out of my control.
That may be true, but if someone sent you a quart of used motor oil and asked you to transform it into HTML, would you still accept it? Usually data interchange is done based on a contract (formal or informal), that the interchanged data will adhere to certain criteria. If it doesn't live up to the agreed-upon criteria, the data can be sent back, rejected.
If you're not requiring XML as input, this question is not about "<, & chars that appear inside XML nodes". Rather, it's about parsing SGML that looks a lot like XML, but which has < and & chars that appear in text content.
And to do that, .NET Tidy and SGMLReader are good solutions, as others have said.
Upvotes: 0
Reputation: 1683
I've always just used replace for XML (saves me having to bring in HTTP libraries):
string output = inputXml.Replace("&", "&")
.Replace("<", "<")
.Replace(">", "&tg;")
.Replace("'", "'") // optional
.Replace("\"", "&Quot;") // optional
Upvotes: 0
Reputation: 17792
You should have a look at SgmlReader:
http://developer.mindtouch.com/SgmlReader
It will give you exactly what you wants :) I use it here: http://www.xmltools.dk/HtmlToXml try it :) (you can disable the html tag and the uppercase-tags->lowercase-tags conversion.)
Upvotes: 0
Reputation: 21088
There's a couple of .Net wrappers around the tidy library.
http://users.rcn.com/creitzel/tidy.html#dotnet
http://www.codeproject.com/KB/mcpp/eftidynet.aspx
And there is a .Net Port of tidy.
Upvotes: 0