Reputation: 13347
I have the following line of text (see in code as well:
What I'm trying to do do is escape that emoticon (phone icon) as two \u chars then back to its original phone icon? The first method below works fine but I essentially want to escape by a range so that I can escape any chars like this. I don't know how this is possible using the first method below.
How can I achieve this range based escape using the UnicodeEscaper as the same output as StringEscapeUtils (i.e. escape to two \uxx \uxx then unescape back to phone icon)?
import org.apache.commons.lang3.text.translate.UnicodeEscaper;
import org.apache.commons.lang3.text.translate.UnicodeUnescaper;
String text = "Unicode surrogate here-> 📱<--here";
// escape the entire string...not what I want because there could
// be \n \r or any other escape chars that I want left in tact (i just want a range)
String text2 = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
System.out.println(text2); // "Unicode surrogate here-> \uD83D\uDCF1<--here"
// unescape it back to the phone emoticon
text2 = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
System.out.println(text2); // "Unicode surrogate here-> 📱<--here"
// How do I do the same as above but but looking for a range of chars to escape (i.e. any unicode surrogate)
// , which is what i want and not to escape the entire string
text2 = UnicodeEscaper.between(0x10000, 0x10FFFF).translate(text);
System.out.println(text2); // "Unicode surrogate here-> \u1F4F1<--here"
// unescape .... (need the phone emoticon here)
text2 = (new UnicodeUnescaper().translate(text2));
System.out.println(text2);// "Unicode surrogate here-> 1<--here"
Upvotes: 0
Views: 1946
Reputation: 121702
Your string:
"Unicode surrogate here-> \u1F4F1<--here"
does not do what you think it does.
A char
is basically a UTF-16 code unit, therefore 16 bits. So what happens here is that you have \u1f41 1
; and that explains your output.
I don't know what you call "escape" here, but if this is replacing surrogate pairs by "\u\u", then have a look at Character.toChars()
. It will return the char
sequence necessary to represent one Unicode code point, whether it is in the BMP (one char) or not (two chars).
For code point U+1f4f1, it will return a two-element char array with characters 0xd83d and 0xdcf1 in that order. And this is what you want.
Upvotes: 2
Reputation: 795
Too late answer. But I've found you need
org.apache.commons.lang3.text.translate.JavaUnicodeEscaper
class instead UnicodeEscaper.
Using it, it prints:
Unicode surrogate here-> \uD83D\uDCF1<--here
And the unescaping works well.
Upvotes: 3