Alex R
Alex R

Reputation: 11891

Emoji character sequence �� breaks old XML process

I have an old Java application that processes XML from a third-party data feed.

The data feed allows user-input, and it is now suddenly containing emojis such as �� (👇). I'm actually surprised it took this long for this problem to appear (emojis have been around for a few years now).

The app blows up in javax.xml.parsers.DocumentBuilder.parse(InputStream):

org.xml.sax.SAXParseException; lineNumber: 105; columnNumber: 3039; Character reference "&#
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)

Is there a quick, localized fix that I can apply without having to redesign and rearchitect the whole application? Also, would prefer to avoid a regex search/replace hack since that can introduce other subtle problems.

Upvotes: 3

Views: 3777

Answers (1)

Michael Kay
Michael Kay

Reputation: 163458

�� is a single character encoded as a surrogate pair (two surrogates). A character reference in XML cannot represent a (high or low) surrogate: these are not legal characters. The character reference should represent the Unicode codepoint of the Emoji as a whole, 👇.

The third party is sending you invalid XML, and you should reject it as you would reject any other faulty goods from a supplier.

Upvotes: 8

Related Questions