Reputation: 1
I'm trying to parse and edit an XML file that is encoded in UTF-8, however certain characters are being returned as what looks like their HTML numerical codes instead of the characters themselves.
To troubleshoot this problem I've set up a DOM parser to basically make a copy of the XML with no edits. I'm specifically working with Japanese kanji/Chinese characters, however some of the characters are being parsed and returned as their HTML codes. I've tried specifying the encoding as UTF-8 on the input stream, the transformer, as well as the output stream, but the results are the same. I took this particular code excerpt from https://www.journaldev.com/901/modify-xml-file-in-java-dom-parser.
String filePath = "file path";
File xmlFile = new File(filePath);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xmlFile);
doc.getDocumentElement().normalize();
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File("updated.xml"));
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(source, result);
System.out.println("XML file updated successfully");
} catch (SAXException | ParserConfigurationException | IOException | TransformerException e1)
{
e1.printStackTrace();
}
}
This is what the XML looks like before parsing, and should look the same after being returned:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Entry for Kanji: 𠮟 -->
<character>
<literal>𠮟</literal>
</character>
This what what is being returned:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Entry for Kanji: 𠮟 -->
<character>
<literal>𠮟</literal>
</character>
Upvotes: 0
Views: 322
Reputation: 17353
It seems that the core problem is that Transformer.transform()
will only support the "clean" transformation of characters in the Basic Multilingual Plane (BMP), though there may be more to the story than that. I cloned the code from your link and created an input XML file based on your example containing several CJK characters:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<character>
<!-- Basic Multilingual Plane -->
<!-- CJK Unified Ideographs Extension A -->
<literal>U+3400 㐀</literal>
<literal>U+4DB5 䶵</literal>
<!-- CJK Unified Ideographs Extension -->
<literal>U+53F1 叱</literal>
<!-- Supplementary Ideographic Plane -->
<!-- CJK Unified Ideographs Extension B -->
<literal>U+20000 𠀀</literal>
<literal>U+20B9F 𠮟</literal>
<literal>U+2A6D6 𪛖</literal>
<!-- CJK Unified Ideographs Extension C 𫜴 -->
<literal>U+2A700 𪜀</literal>
<literal>U+2B734 𫜴</literal>
<!-- CJK Unified Ideographs Extension D -->
<literal>U+2B740 𫝀</literal>
<literal>U+2B81D 𫠝</literal>
</character>
When I ran the application (using JDK 11) the three CJK characters that were in the BMP were transformed correctly, but all of those in the Supplementary Ideographic Plane (SIP) were transformed to HTML escape codes. Here's the XML file that was created:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<character>
<!-- Basic Multilingual Plane -->
<!-- CJK Unified Ideographs Extension A -->
<literal>U+3400 㐀</literal>
<literal>U+4DB5 䶵</literal>
<!-- CJK Unified Ideographs Extension -->
<literal>U+53F1 叱</literal>
<!-- Supplementary Ideographic Plane -->
<!-- CJK Unified Ideographs Extension B -->
<literal>U+20000 𠀀</literal>
<literal>U+20B9F 𠮟</literal>
<literal>U+2A6D6 𪛖</literal>
<!-- CJK Unified Ideographs Extension C 𫜴 -->
<literal>U+2A700 𪜀</literal>
<literal>U+2B734 𫜴</literal>
<!-- CJK Unified Ideographs Extension D -->
<literal>U+2B740 𫝀</literal>
<literal>U+2B81D 𫠝</literal>
</character>
When I run the code in the debugger it seems that the JRE uses Xalan for its implementation of Transformer.transform()
. There is a very old SO post Serializing supplementary unicode characters into XML documents with Java which is not a duplicate of your problem, but it is related. The poster even raised a Xalan bug report for the issue ToXMLStream does not support unicode supplementary characters in 2012 which is still open!
The character 𠮟
(U+20B9F) that you mentioned in your comment is in the SIP, which is presumably why it was transformed to an escape code, whereas the very similar character 叱
(U+53F1) is in the BMP and transformed correctly.
I don't know why this issue exists, but there are several possible reasons:
Transformer.transform()
only supports characters in the BMP.Transformer.transform()
does not support the transformation of four byte Unicode characters.Upvotes: 1