mjn
mjn

Reputation: 36644

Decoding of Unicode characters in a ISO-8859-1 encoded XML document

Using javax.xml.transform I created this ISO-8859-1 document which contains two &#-encoded characters and :

<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>&#50108; and &#50102;</xml>

Question: how will a standards-compliant XML reader interpret the 쎼 and 쎶,


Code to generate the XML:

public void testInvalidCharacter() {
    try {
        String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
        System.out.println(str);

        DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = builder.newDocument();
        Element root = doc.createElement("xml");
        root.setTextContent(str);
        doc.appendChild(root);

        DOMSource domSource = new DOMSource(doc);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());

        StringWriter out = new StringWriter();
        transformer.transform(domSource, new StreamResult(out));

        System.out.println(out.toString());

    } catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
        e.printStackTrace(System.err);
    }
}

Upvotes: 0

Views: 324

Answers (1)

Thomas Philipp
Thomas Philipp

Reputation: 276

An XML Parser will recognize the '&#...' escape syntax and properly return 쎼 and 쎶 with its API for the text of the element. E.g. in Java the org.w3c.dom.Element.getTextContent() method for the Element with the tag Name 'xml' will return a String with that Unicode characters, though your XML document itself is ISO-8859-1

Upvotes: 1

Related Questions