Reputation: 1563
I try to parse XML file via xml.sax.handler.ContentHandler
subclass. The parser fails at the following line:
<desc>some_text</desc>
and I get the following error:
xml.sax._exceptions.SAXParseException: test.xml:687338:17: reference to invalid character number
The spec(http://www.w3.org/TR/xml/#sec-references) says that the characters 
and 
are valid. So is there a bug in a parser or I'm doing something wrong?
Upvotes: 0
Views: 1412
Reputation: 489638
Although you can encode these characters, they're still at best "frowned upon". See http://www.w3.org/TR/xml/#NT-Char for a list of "bad" characters. Then, see this 1.1 spec as well, which adds some back as allowed in some cases, as "restricted" characters.
If the text legitimately should be able to include these characters, it's wise to encode it first, e.g., with base64 encoding. The receiver thus gets well-formed XML (for XML 1.1, it's not always required but that will make it compatible with 1.0).
I had to deal with externally-supplied invalid XML myself once before, where I had no control over the sender. It's pretty messy. In my case I could rely on certain patterns, and hence use regular expressions to "patch away" improprieties, but this is a hack: a workaround done out of desperation, instead of a proper fix.
(In my case I had to handle things that would have tripped up even an XML 1.1 parser—the sender was just plain broken, a bunch of perl code using faulty regexp's and some literal <foo> type strings to generate pretend-XML—so I never looked any further.)
Upvotes: 1
Reputation: 163585
The characters at Unicode codepoints 15 and 18 are allowed in XML 1.1 but not in XML 1.0.
It looks like your parser doesn't support XML 1.1 (many don't).
You either need to get an XML 1.1 parser (and ensure that it says version="1.1" in the XML declaration), or you need to fix the process that is producing ill-formed XML.
Upvotes: 1