user5199450
user5199450

Reputation: 41

java xml parser with emoji character

The following code is used to parse an xml file. I noticed that the emoji char is not being parsed correctly. In the example, input has one emoji at the end(http://www.iemoji.com/view/emoji/693/people/revolving-hearts), the character is doubled in the output. Is this a known bug?

import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class XmlTest {

    public static void main(String[] args) {            
        DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
        domFactory.setValidating(false);
        File file = new File("c:\\temp\\emoji.xml");

        try {
            DocumentBuilder builder = domFactory.newDocumentBuilder();
            Document doc = builder.parse(file);

            NodeList nodes = doc.getElementsByTagName("entry");
            Node node = nodes.item(0);
            NamedNodeMap map = ((Element)node).getAttributes();

            for (int i=0; i<map.getLength(); i++) {
                Node n = map.item(i);
                System.out.println();
                System.out.println(n.getNodeValue());

                char[] chars = n.getNodeValue().toCharArray();

                for (int j=0; j<chars.length; j++) {
                    System.out.print(chars[j] + ", " + (int)chars[j] + "  ");                   
                }
            }

        } catch (Exception e) {e.printStackTrace(); }
    }
}

Here's the input emoji.xml:

<Attributes>
  <Map>
    <entry key="name" value="πŸ’žtestπŸ’ž"/>
  </Map>
</Attributes>

and output:

name
n, 110  a, 97  m, 109  e, 101  
πŸ’žtestπŸ’žπŸ’ž
?, 55357  ?, 56478  t, 116  e, 101  s, 115  t, 116  ?, 55357  ?, 56478  ?, 55357  ?, 56478

Upvotes: 4

Views: 3804

Answers (2)

Zman777
Zman777

Reputation: 11

A few updates: This issue has been fixed in the early access release version of Java 9 (build 9-ea+103-2016-01-27-183833.javare.4341.nc). It still exists in the latest build of Java 8 (build 1.8.0_72-b15). For some reason Oracle closed the bug that was opened due to my service request against Java 6/7/8 for this issue (as not reproducable). I'm trying to get them to re-open it.

Here is the exact same issue, opened against openjdk, they fixed it in openjdk 9: https://bugs.openjdk.java.net/browse/JDK-8062362

Upvotes: 1

wero
wero

Reputation: 33000

I can reproduce the problem using JDK 1.7.

The cause for the problem seems to be a bug in the XML parser shipped with the JDK (In this case it is Xerces, located in packages com.sun.org.apache.xerces.internal.* in rt.jar)

The emoji characters are not in the Unicode BMP and are therefore represented as two chars (high and low surrogate). When the parser encounters these surrogates it treats them in a special way and checks if they are a valid XML character when converted to a supplemental character.

The buggy code is located in XMLScanner.scanAttributeValue in the following code section

           } else if (c != -1 && XMLChar.isHighSurrogate(c)) {
                if (scanSurrogates(fStringBuffer3)) {
                    stringBuffer.append(fStringBuffer3);
                    if (entityDepth == fEntityDepth && fNeedNonNormalizedValue) {
                        fStringBuffer2.append(fStringBuffer3);
                    }

The two chars for the emoji character are parsed into a buffer variable fStringBuffer3 and then appended to the buffer for the attribute value. The problem now is that fStringBuffer3 is not cleared. When parsing the second emoji character it still contains the old content and therefore the chars are appended twice.

If you try with an attribute value containing three or more emojis you clearly see how they accumulate.

Upvotes: 5

Related Questions