Dan Hastings
Dan Hastings

Reputation: 3280

C# Escape illegal xml characters from node text only

I am working with an API and for some crazy reason the XML being returned has & characters that are not correctly escaped. This has left me in an annoying position. I get an exception when i try to use an XMLDocument to parse the xml string.

I can use replace to get rid of the characters, but this could lead to issues.

xml = xml.Replace("&", "&").Replace("&", "&");

The problem with this is that there may end up being some escaped values. A node like this will cause the line of code above to get screwed up.

<node>Something & something &lt; annoying</node>

If i replace the & characters with amp; it will break lt;. I cant use the same approach for lt; as i did for the amp as it will mean that it will convert all of the <> brackets that i still need to get escaped.

Here is a node that is giving trouble.

<CompanyName>Fire & Ice</CompanyName>

Upvotes: 2

Views: 655

Answers (2)

Mehmet
Mehmet

Reputation: 755

I recommend to you XElement.XElement is useful object.XElement.Value will return string that you want.

using System.Xml.Linq;
XElement y = new XElement("CompanyNames",
                new XElement("CompanyName", "Fire & Ice")
                );
foreach (var item in y.Elements("CompanyName"))
{
   Console.WriteLine(item.Value);
}  

Output will be "Fire & Ice"

Upvotes: -1

Charles Mager
Charles Mager

Reputation: 26213

You can use a similar regex to this related question. This essentialy matches all unescaped ampersands (i.e. it will match &, but not &something;).

var xml = @"<node>Something & something &lt; annoying</node>";

var result = Regex.Replace(xml, @"&(?!\w*;)", "&amp;");

// output: <node>Something &amp; something &lt; annoying</node>

Upvotes: 4

Related Questions