Andy Evans
Andy Evans

Reputation: 7176

Filter certain unicode characters out of XML

... specifically xA3 (&pound, &#xa3, &#163)

I'm loading several long XML documents and periodically, I'll run into one that won't load, throwing the exception:

Invalid character in the given encoding. Line x, position y.

Here's the code in question:

var doc = new XmlDocument();
doc.Load(file.FullName);

When I look at the document in question at the line indicated, I'll see the xA3 formatted inversely (black bg, white fg) within one of the XML tags.

The header of each XML file is nothing remarkable:

<?xml version="1.0" encoding="UTF-8"?> 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

This may sound like a really dumb question, but is there a way to either remove the offending character or tell the XMLDocument that reads the file to accept the character coding?

Upvotes: 0

Views: 602

Answers (1)

Jenszcz
Jenszcz

Reputation: 547

This answer is based on the assumption that your XML file does not contain the character entity &#xa3; but the byte value 0xa3.

The UTF-8 code for the pound sign is the two byte code 0xc2 0xa3. If there is no byte 0xc2 before 0xa3 the encoding of your XML file is not UTF-8, and the header information is wrong.

If this is the case you can either change the encoding in the XML header to ISO 8859-1 (where the pound sign can be found at code point 0xa3), or try to figure out why your XML files are not UTF-8 encoded and fix them. As I don't know if your files contain any characters that do not exist in ISO 8859-1 I would prefer the second option.

Upvotes: 2

Related Questions