mitchellt
mitchellt

Reputation: 1022

Parsing XML which contains illegal characters

A message I receive from a server contains tags and in the tags is the data I need.

I try to parse the payload as XML but illegal character exceptions are generated.

I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.

My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_

Thanks.

Example:

<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>

Upvotes: 4

Views: 4032

Answers (3)

Michael Kay
Michael Kay

Reputation: 163587

Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.

When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.

If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.

Upvotes: 2

Selman Gen&#231;
Selman Gen&#231;

Reputation: 101701

Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:

var invalidChars = new [] { '&', other chars comes here.. };

Then read all the xml as a whole text:

var xmlContent = File.ReadAllText("path");

Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:

var validContent = string.Concat(xmlContent
        .Select(x =>
        {
            if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
            return x.ToString();
        }));

Then parse it using XDocument.Parse, that's all.

Upvotes: 1

Ulugbek Umirov
Ulugbek Umirov

Reputation: 12807

If you have only & as invalid character, then you can use regex to replace it with &amp;. We use regex to prevent replacement of already existing &amp;, &quot;, &#111;, etc. symbols.

Regex can be as follows:

&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)

Regular expression visualization

Sample code:

string content = @"<item><code>1234 &amp; test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, @"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&amp;", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);

Upvotes: 6

Related Questions