BladeITA
BladeITA

Reputation: 31

How can i replace every emoji in a string with their unicode in java?

I have a string like this:

"\"title\":\"πŸ‘ΊTEST title value 😁\",\"text\":\"πŸ’– TEST text value.\"" ...

and i want to replace every emoji symbol with their unicode value like so:

"\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"" ...

After searching a lot on the web, I found a way to "translate" one symbol to their unicode with this code:

String s = "πŸ‘Ί";
int emoji = Character.codePointAt(s, 0); 
String unumber = "U+" + Integer.toHexString(emoji).toUpperCase();

But now how can i change my code to get all emoji in a string?

P.s. it can either be \Uxxxxx or U+xxxxx format

Upvotes: 3

Views: 6643

Answers (3)

skomisa
skomisa

Reputation: 17363

In your code you don't need to specify any code point ranges, nor do you need to worry about surrogates. Instead, just specify the Unicode blocks for which you want characters to be presented as Unicode escapes. This is achieved by using the field declarations in the Character.UnicodeBlock class. For example, to determine whether 😁(0x1F601) is an emoticon:

boolean emoticon = Character.UnicodeBlock.EMOTICONS.equals(Character.UnicodeBlock.of("😁".codePointAt(0)));
System.out.println("Is 😁 an emoticon? " + emoticon); // Prints true.

Here's general purpose code. It will process any String, presenting individual characters as their Unicode equivalents if they are defined within the specified Unicode code blocks:

package symbolstounicode;

import java.util.List;
import java.util.stream.Collectors;

public class SymbolsToUnicode {

    public static void main(String[] args) {

        Character.UnicodeBlock[] blocksToConvert = new Character.UnicodeBlock[]{
            Character.UnicodeBlock.EMOTICONS, 
            Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS};
        String input = "\"title\":\"πŸ‘ΊTEST title value 😁\",\"text\":\"πŸ’– TEST text value.\"";
        String output = SymbolsToUnicode.toUnicode(input, blocksToConvert);

        System.out.println("String to convert: " + input);
        System.out.println("Converted string: " + output);
        assert ("\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"".equals(output));
    }

    // Converts characters in the supplied string found in the specified list of UnicodeBlocks to their Unicode equivalents.
    static String toUnicode(String s, final Character.UnicodeBlock[] blocks) {

        StringBuilder sb = new StringBuilder("");
        List<Integer> cpList = s.codePoints().boxed().collect(Collectors.toList());

        cpList.forEach(cp -> sb.append(SymbolsToUnicode.inCodeBlock(cp, blocks) ? 
                "U+" + Integer.toHexString(cp).toUpperCase() : Character.toString(cp)));
        return sb.toString();
    }

    // Returns true if the supplied code point is within one of the specified UnicodeBlocks.
    static boolean inCodeBlock(final int cp, final Character.UnicodeBlock[] blocksToConvert) {

        for (Character.UnicodeBlock b : blocksToConvert) {
            if (b.equals(Character.UnicodeBlock.of(cp))) {
                return true;
            }
        }
        return false;
    }
}

And here's the output, using the test data in the OP:

run:
String to convert: "title":"πŸ‘ΊTEST title value 😁","text":"πŸ’– TEST text value."
Converted string: "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

  • I used font Segoe UI Symbol for the code and the output window to render the symbols properly.
  • The basic idea in the code is:
    • First, specify the String to be converted, and the Unicode code blocks for which characters should be converted to Unicode.
    • Next, convert the String into a set of code points using String.codePoints(), and store them in a List.
    • Finally, for each code point, determine whether it exists within any of the specified Unicode blocks, and convert it if necessary.

Upvotes: 0

AterLux
AterLux

Reputation: 4654

Emoji are scattered among different unicode blocks. For example πŸ‘Ί(0x1F47A) and πŸ’–(0x1F496) are from Miscellaneous Symbols and Pictographs, while 😁(0x1F601) is from Emoticons

If you want to filter out symbols you need to decide what unicode blocks (or their range) you want to use. For example:

    String s = "\"title\":\"πŸ‘ΊTEST title value 😁\",\"text\":\"πŸ’– TEST text value.\"";
    StringBuilder sb = new StringBuilder();
    for (int i = 0, l = s.length() ; i < l ; i++) {
      char ch = s.charAt(i);
      if (Character.isHighSurrogate(ch)) {
        i++;
        char ch2 = s.charAt(i); // Load low surrogate
        int codePoint = Character.toCodePoint(ch, ch2);
        if ((codePoint >= 0x1F300) && (codePoint <= 0x1F64F)) { // Miscellaneous Symbols and Pictographs + Emoticons
          sb.append("U+").append(Integer.toHexString(codePoint).toUpperCase());
        } else { // otherwise just add characters as is
          sb.append(ch);
          sb.append(ch2);
        }
      } else { // if not a surrogate, just add the character
        sb.append(ch);
      }
    }
    String result = sb.toString();
    System.out.println(result); // "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."

To get only emojis you can narrow the condition using, for example, this list

But if you want to escape any surrogate symbol, you can get rid of codePoint check inside the code

Upvotes: 1

Bohdan Gaponec
Bohdan Gaponec

Reputation: 81

Try this solution:

String s = "your string with emoji";

StringBuilder sb = new StringBuilder();

for (int i = 0; i < s.length(); i++) {
  if (Character.isSurrogate(s.charAt(i))) {
    Integer res = Character.codePointAt(s, i);
    i++;
    sb.append("U+" + Integer.toHexString(res).toUpperCase());
  } else {
    sb.append(s.charAt(i));
  }
}

//result
System.out.println(sb.toString());

Upvotes: 4

Related Questions