Reputation: 187339
I'm trying to parse this XML document with a SAX parser:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE WIN_TPBOUND_MESSAGES SYSTEM "tpbound_messages_v1.dtd">
<WIN_TPBOUND_MESSAGES>
<SMSTOTP>
<SOURCE_ADDR>+447522579247</SOURCE_ADDR>
<TEXT>TEST: @£$¥èéùìò?ØøÅå& ^{}\\[~]¡€ÆæßÉ!\"#¤%'()*+,-./0123456789:;<=>? ÄÖÑܧ¿äöñüà end</TEXT>
<WINTRANSACTIONID>652193268</WINTRANSACTIONID>
</SMSTOTP>
</WIN_TPBOUND_MESSAGES>
After parsing the <TEXT>
element, the content is converted to:
TEST: @£$¥èéùìò?Ã�øÃ�Ã¥& ^{}\\[~]¡€Ã�æÃ�Ã�!\"#¤%'()*+,-./0123456789:;<=>? Ã�Ã�Ã�Ã�§¿äöñüà end
So clearly something bad is happening to the non-ASCII characters. The code that parses the XML is shown below:
public void parse(InputStream xmlStream) throws WinGatewayException {
XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
parser.setContentHandler(this);
parser.setErrorHandler(error);
parser.setEntityResolver(new DTDResolver());
parser.setDTDHandler(this);
parser.setFeature("http://xml.org/sax/features/validation", true);
parser.setFeature("http://apache.org/xml/features/validation/schema", true);
parser.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", true);
parser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
parser.setFeature("http://apache.org/xml/features/continue-after-fatal-error", false);
parser.parse(new InputSource(xmlStream));
}
and the object referred to by this
has methods such as:
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (localName.equals("TEXT")) {
logger.debug("Parsed message text: " + cData.toString());
message.setText(cData.toString());
}
}
Why aren't these non-ASCII characters being preserved by the XML parser?
Upvotes: 1
Views: 5031
Reputation: 1501626
I believe your XML file is actually in UTF-8 rather than ISO-8859-1.
An ISO-8859-1-encoded file would have a single byte per character, so the UK pound sign would be a single byte 0xA3. However, it looks like your file has 0xC2 0xA3, which is the byte sequence you'd get for U+00A3 in UTF-8.
Change the XML declaration to reflect this:
<?xml version="1.0" encoding="UTF-8"?>
and see if that fixes things. Assuming it does, you then need to work out what's produced this bad data to start with.
Upvotes: 3