parsing non-ASCII character in XML document

Question

I'm trying to parse this XML document with a SAX parser:




    
        +447522579247
        TEST: @£$¥èéùìò?ØøÅå& ^{}\[~]¡€ÆæßÉ!"#¤%'()*+,-./0123456789:;<=>? ÄÖÑÜ§¿äöñüà end
        652193268

After parsing the element, the content is converted to:

TEST: @Â£$Â¥Ã¨Ã©Ã¹Ã¬Ã²?Ã�Ã¸Ã�Ã¥& ^{}\[~]Â¡€Ã�Ã¦Ã�Ã�!"#Â¤%'()*+,-./0123456789:;<=>? Ã�Ã�Ã�Ã�Â§Â¿Ã¤Ã¶Ã±Ã¼Ã  end

So clearly something bad is happening to the non-ASCII characters. The code that parses the XML is shown below:

public void parse(InputStream xmlStream) throws WinGatewayException {
    XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
    parser.setContentHandler(this);
    parser.setErrorHandler(error);
    parser.setEntityResolver(new DTDResolver());
    parser.setDTDHandler(this);
    parser.setFeature("http://xml.org/sax/features/validation", true);
    parser.setFeature("http://apache.org/xml/features/validation/schema", true);
    parser.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", true);
    parser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
    parser.setFeature("http://apache.org/xml/features/continue-after-fatal-error", false);
    parser.parse(new InputSource(xmlStream));
}

and the object referred to by this has methods such as:

public void endElement(String uri, String localName, String qName)
        throws SAXException {

        if (localName.equals("TEXT")) {   
            logger.debug("Parsed message text: " + cData.toString());
            message.setText(cData.toString());
        }
}

Why aren't these non-ASCII characters being preserved by the XML parser?

Jon Skeet · Accepted Answer

I believe your XML file is actually in UTF-8 rather than ISO-8859-1.

An ISO-8859-1-encoded file would have a single byte per character, so the UK pound sign would be a single byte 0xA3. However, it looks like your file has 0xC2 0xA3, which is the byte sequence you'd get for U+00A3 in UTF-8.

Change the XML declaration to reflect this:

and see if that fixes things. Assuming it does, you then need to work out what's produced this bad data to start with.

parsing non-ASCII character in XML document

Answers (1)

Related Questions