Reputation: 31
I have a string like this:
"\"title\":\"πΊTEST title value π\",\"text\":\"π TEST text value.\"" ...
and i want to replace every emoji symbol with their unicode value like so:
"\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"" ...
After searching a lot on the web, I found a way to "translate" one symbol to their unicode with this code:
String s = "πΊ";
int emoji = Character.codePointAt(s, 0);
String unumber = "U+" + Integer.toHexString(emoji).toUpperCase();
But now how can i change my code to get all emoji in a string?
P.s. it can either be \Uxxxxx or U+xxxxx format
Upvotes: 3
Views: 6643
Reputation: 17363
In your code you don't need to specify any code point ranges, nor do you need to worry about surrogates. Instead, just specify the Unicode blocks for which you want characters to be presented as Unicode escapes. This is achieved by using the field declarations in the Character.UnicodeBlock
class. For example, to determine whether π(0x1F601) is an emoticon:
boolean emoticon = Character.UnicodeBlock.EMOTICONS.equals(Character.UnicodeBlock.of("π".codePointAt(0)));
System.out.println("Is π an emoticon? " + emoticon); // Prints true.
Here's general purpose code. It will process any String
, presenting individual characters as their Unicode equivalents if they are defined within the specified Unicode code blocks:
package symbolstounicode;
import java.util.List;
import java.util.stream.Collectors;
public class SymbolsToUnicode {
public static void main(String[] args) {
Character.UnicodeBlock[] blocksToConvert = new Character.UnicodeBlock[]{
Character.UnicodeBlock.EMOTICONS,
Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS};
String input = "\"title\":\"πΊTEST title value π\",\"text\":\"π TEST text value.\"";
String output = SymbolsToUnicode.toUnicode(input, blocksToConvert);
System.out.println("String to convert: " + input);
System.out.println("Converted string: " + output);
assert ("\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"".equals(output));
}
// Converts characters in the supplied string found in the specified list of UnicodeBlocks to their Unicode equivalents.
static String toUnicode(String s, final Character.UnicodeBlock[] blocks) {
StringBuilder sb = new StringBuilder("");
List<Integer> cpList = s.codePoints().boxed().collect(Collectors.toList());
cpList.forEach(cp -> sb.append(SymbolsToUnicode.inCodeBlock(cp, blocks) ?
"U+" + Integer.toHexString(cp).toUpperCase() : Character.toString(cp)));
return sb.toString();
}
// Returns true if the supplied code point is within one of the specified UnicodeBlocks.
static boolean inCodeBlock(final int cp, final Character.UnicodeBlock[] blocksToConvert) {
for (Character.UnicodeBlock b : blocksToConvert) {
if (b.equals(Character.UnicodeBlock.of(cp))) {
return true;
}
}
return false;
}
}
And here's the output, using the test data in the OP:
run:
String to convert: "title":"πΊTEST title value π","text":"π TEST text value."
Converted string: "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
String
to be converted, and the Unicode code blocks for which characters should be converted to Unicode. String
into a set of code points using String.codePoints()
, and store them in a List
. Upvotes: 0
Reputation: 4654
Emoji are scattered among different unicode blocks. For example πΊ(0x1F47A) and π(0x1F496) are from Miscellaneous Symbols and Pictographs, while π(0x1F601) is from Emoticons
If you want to filter out symbols you need to decide what unicode blocks (or their range) you want to use. For example:
String s = "\"title\":\"πΊTEST title value π\",\"text\":\"π TEST text value.\"";
StringBuilder sb = new StringBuilder();
for (int i = 0, l = s.length() ; i < l ; i++) {
char ch = s.charAt(i);
if (Character.isHighSurrogate(ch)) {
i++;
char ch2 = s.charAt(i); // Load low surrogate
int codePoint = Character.toCodePoint(ch, ch2);
if ((codePoint >= 0x1F300) && (codePoint <= 0x1F64F)) { // Miscellaneous Symbols and Pictographs + Emoticons
sb.append("U+").append(Integer.toHexString(codePoint).toUpperCase());
} else { // otherwise just add characters as is
sb.append(ch);
sb.append(ch2);
}
} else { // if not a surrogate, just add the character
sb.append(ch);
}
}
String result = sb.toString();
System.out.println(result); // "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."
To get only emojis you can narrow the condition using, for example, this list
But if you want to escape any surrogate symbol, you can get rid of codePoint
check inside the code
Upvotes: 1
Reputation: 81
Try this solution:
String s = "your string with emoji";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
if (Character.isSurrogate(s.charAt(i))) {
Integer res = Character.codePointAt(s, i);
i++;
sb.append("U+" + Integer.toHexString(res).toUpperCase());
} else {
sb.append(s.charAt(i));
}
}
//result
System.out.println(sb.toString());
Upvotes: 4