Reputation: 36644
Using javax.xml.transform I created this ISO-8859-1 document which contains two &#-encoded characters 쎼
and 쎶
:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>쎼 and 쎶</xml>
Question: how will a standards-compliant XML reader interpret the 쎼 and 쎶,
쎼
and 쎶
)쎼
and 쎶
Code to generate the XML:
public void testInvalidCharacter() {
try {
String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
System.out.println(str);
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("xml");
root.setTextContent(str);
doc.appendChild(root);
DOMSource domSource = new DOMSource(doc);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());
StringWriter out = new StringWriter();
transformer.transform(domSource, new StreamResult(out));
System.out.println(out.toString());
} catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
e.printStackTrace(System.err);
}
}
Upvotes: 0
Views: 324
Reputation: 276
An XML Parser will recognize the '&#...' escape syntax and properly return 쎼 and 쎶 with its API for the text of the element. E.g. in Java the org.w3c.dom.Element.getTextContent() method for the Element with the tag Name 'xml' will return a String with that Unicode characters, though your XML document itself is ISO-8859-1
Upvotes: 1