Tri
Tri

Reputation: 1

Certain unicode characters being returned as their HTML code after parsing

I'm trying to parse and edit an XML file that is encoded in UTF-8, however certain characters are being returned as what looks like their HTML numerical codes instead of the characters themselves.

To troubleshoot this problem I've set up a DOM parser to basically make a copy of the XML with no edits. I'm specifically working with Japanese kanji/Chinese characters, however some of the characters are being parsed and returned as their HTML codes. I've tried specifying the encoding as UTF-8 on the input stream, the transformer, as well as the output stream, but the results are the same. I took this particular code excerpt from https://www.journaldev.com/901/modify-xml-file-in-java-dom-parser.

String filePath = "file path";
File xmlFile = new File(filePath);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
    dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(xmlFile);

    doc.getDocumentElement().normalize();
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
    DOMSource source = new DOMSource(doc);
    StreamResult result = new StreamResult(new File("updated.xml"));
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.transform(source, result);
    System.out.println("XML file updated successfully");

} catch (SAXException | ParserConfigurationException | IOException | TransformerException e1) 
{
    e1.printStackTrace();
}
}

This is what the XML looks like before parsing, and should look the same after being returned:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Entry for Kanji: 𠮟 -->
<character>
  <literal>𠮟</literal>
</character>

This what what is being returned:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Entry for Kanji: 𠮟 -->
<character>
  <literal>&#134047;</literal>
</character>

Upvotes: 0

Views: 322

Answers (1)

skomisa
skomisa

Reputation: 17353

It seems that the core problem is that Transformer.transform() will only support the "clean" transformation of characters in the Basic Multilingual Plane (BMP), though there may be more to the story than that. I cloned the code from your link and created an input XML file based on your example containing several CJK characters:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<character>
    <!-- Basic Multilingual Plane -->
    <!-- CJK Unified Ideographs Extension A -->
    <literal>U+3400 㐀</literal>
    <literal>U+4DB5 䶵</literal>
    <!-- CJK Unified Ideographs Extension -->
    <literal>U+53F1 叱</literal>
    <!-- Supplementary Ideographic Plane -->
    <!-- CJK Unified Ideographs Extension B -->
    <literal>U+20000 𠀀</literal>
    <literal>U+20B9F 𠮟</literal>
    <literal>U+2A6D6 𪛖</literal>
    <!-- CJK Unified Ideographs Extension C 𫜴 -->
    <literal>U+2A700 𪜀</literal>
    <literal>U+2B734 𫜴</literal>
    <!-- CJK Unified Ideographs Extension D -->
    <literal>U+2B740 𫝀</literal>
    <literal>U+2B81D 𫠝</literal>
</character>

When I ran the application (using JDK 11) the three CJK characters that were in the BMP were transformed correctly, but all of those in the Supplementary Ideographic Plane (SIP) were transformed to HTML escape codes. Here's the XML file that was created:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<character>
    <!-- Basic Multilingual Plane -->
    <!-- CJK Unified Ideographs Extension A -->
    <literal>U+3400 㐀</literal>
    <literal>U+4DB5 䶵</literal>
    <!-- CJK Unified Ideographs Extension -->
    <literal>U+53F1 叱</literal>
    <!-- Supplementary Ideographic Plane -->
    <!-- CJK Unified Ideographs Extension B -->
    <literal>U+20000 &#131072;</literal>
    <literal>U+20B9F &#134047;</literal>
    <literal>U+2A6D6 &#173782;</literal>
    <!-- CJK Unified Ideographs Extension C 𫜴 -->
    <literal>U+2A700 &#173824;</literal>
    <literal>U+2B734 &#177972;</literal>
    <!-- CJK Unified Ideographs Extension D -->
    <literal>U+2B740 &#177984;</literal>
    <literal>U+2B81D &#178205;</literal>
</character>

When I run the code in the debugger it seems that the JRE uses Xalan for its implementation of Transformer.transform(). There is a very old SO post Serializing supplementary unicode characters into XML documents with Java which is not a duplicate of your problem, but it is related. The poster even raised a Xalan bug report for the issue ToXMLStream does not support unicode supplementary characters in 2012 which is still open!

The character 𠮟 (U+20B9F) that you mentioned in your comment is in the SIP, which is presumably why it was transformed to an escape code, whereas the very similar character (U+53F1) is in the BMP and transformed correctly.

I don't know why this issue exists, but there are several possible reasons:

  • Xalan's implementation of Transformer.transform() only supports characters in the BMP.
  • Xalan's implementation of Transformer.transform() does not support the transformation of four byte Unicode characters.
  • Xalan has not been updated to support the CJK characters specified in the more recent CJK Unified Ideographs Extensions.
  • There was a deliberate design decision made to transform SIP characters in that manner. That might seem unlikely, except that:
    • The HTML escape codes are always correct
    • SIP characters are transformed properly within comments.

Upvotes: 1

Related Questions