Reputation: 1882
We have a mobile client that communicates with the server using XML. I have run into a problem, when we need to send some of the more recent UTF-8 smileys, which have been made very easily accessible on new phones. For instance: 😉😯🙃😡😬😠.
Now, my Android application has no issue with encoding and sending this, but on the server side things tend to be a bit more explodey.
If we try to send a message using any of the smileys above we get a huge stack trace, with the relevant part:
javax.xml.transform.TransformerException: org.xml.sax.SAXException: Invalid UTF-16 surrogate detected: d83d d83d ?
java.io.IOException: Invalid UTF-16 surrogate detected: d83d d83d ?
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(Unknown Source)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(Unknown Source)
And if we try to parse it:
2017-01-13 14:00:22,717 - com.zylinc.core.gatekeeper.stripes.DoBean - WARN - Could not handle request
org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 93; Character reference "&#
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.zylinc.core.gatekeeper.stripes.DoBean.parseRequest(DoBean.java:127)
at com.zylinc.core.gatekeeper.stripes.DoBean.execute(DoBean.java:56)
at com.zylinc.core.gatekeeper.Dispatcher.onRequest(Dispatcher.java:107)
at com.zylinc.core.gatekeeper.io.UntrustedSocketListener.handleRequest(UntrustedSocketListener.java:16)
at com.zylinc.core.gatekeeper.io.SocketListener$MessageHandler.run(SocketListener.java:228)
at java.lang.Thread.run(Unknown Source)
In that case the XML is:
<?xml version="1.0" encoding="UTF-8"?><action>
<set>
<absence requestid="0" from="2017 01 13 13 00 11" to="2017 01 13 22 59 11" subject="��" user_id="CN=???????? ????????????,OU=TestUsers,OU=ZyUsers,DC=Zylinc,DC=com"/>
</set>
</action>
Now, this seems to work just fine when outputting JSON, but moving the clients to use JSON is not something we can do overnight. I'm guessing it breaks because the characters used are too new compared to the java version, but it would be nice to ensure that newer smileys won't ever break the messaging.
The code for parsing the XML is pretty straight forward:
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
XMLReader xmlReader = parser.getXMLReader();
xmlReader.setContentHandler(handler);
StringReader reader = new StringReader(xml);
xmlReader.parse(new InputSource(reader));
Edit:
Creating the XML is done like this:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
mDoc = builder.newDocument();
mRoot = mDoc.createElement("action");
mDoc.appendChild(mRoot);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer trans = transFactory.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
trans.setOutputProperty(OutputKeys.VERSION, "1.1");
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(mDoc);
trans.transform(source, result);
return sw.toString();
Where adding the text is simply:
xml.setAttribute(SUBJECT, obj.getSubject());
Do I have to specify some encoding or other?
Upvotes: 0
Views: 3332
Reputation: 163418
You're encoding these incorrectly.
If you're using XML character reference notation, &#NNNNN;
, then N must be a Unicode codepoint, not a Unicode codepoint split up into a surrogate pair. For example, 😎
. In your example, you've got ��
which isn't legal, because 55357 and 56846 are not codepoints, they are the two halves of a surrogate pair.
In the case where you're representing the characters directly, I'm not sure exactly what you're doing, but the error message "Invalid UTF-16 surrogate detected: d83d d83d" makes it very clear that you are doing it wrong.
The title of your question ("UTF-8 like smileys") suggests that you are confused between Unicode and UTF-8. Unicode maps smileys to integer codepoints, e.g. the first one is hex 1f60e or decimal 128526. UTF-8 is one possible way of encoding Unicode as a stream of bytes or octets, and it can encode every Unicode codepoint as a sequence of one to four bytes.
UTF-16 is another encoding, which represents most Unicode codepoints as 16 bits, but those above xffff using a pair of 16 bit values called a surrogate pair. Surrogate pairs are not used in UTF-8. It is quite wrong to attempt to encode a Unicode codepoint in UTF-16 as a surrogate pair, and then to encode each half of this surrogate pair independently in UTF-8. But I somehow suspect this is what you are doing.
Upvotes: 8