Reputation: 1022
A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility
and Security Utility
to escape the illegal characters, only problem is, it will escape < >
which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)
_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>
Upvotes: 4
Views: 4032
Reputation: 163587
Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.
Upvotes: 2
Reputation: 101701
Here is more generalized solution than Regex
. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ
and HttpUtility.HtmlEncode
:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse
, that's all.
Upvotes: 1
Reputation: 12807
If you have only &
as invalid character, then you can use regex to replace it with &
. We use regex to prevent replacement of already existing &
, "
, o
, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = @"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, @"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);
Upvotes: 6