nullsector76
nullsector76

Reputation: 79

java get unicode representation string from unicode codepoint

I want to get the string representation \u that java uses from an integer codepoint. I looked all over the place and have yet to find a working awnswer for \ud83e\udd82 which is 🦂 . I got the symbol from compiling and decompiling a jar from bytecode viewer. I don't know how it's getting these strings or where from. It's very useful when deving in java to copy a unicode character then paste it in and get the java string version of it. So every class doesn't have to be in utf-8 that uses it.

Upvotes: 0

Views: 1665

Answers (2)

nullsector76
nullsector76

Reputation: 79

Here you can convert unicode characters from a string directly to java's format.

    /**
     * return the java unicode string from the utf-8 string
     * TODO: add an option to change the unicode number strings to not just the codepoints
     */
    public static String toUnicodeEsq(String unicode)
    {
        StringBuilder b = new StringBuilder();
        int[] arr = unicode.codePoints().toArray();
        for(int i : arr)
            b.append(toUnicodeEsq(i));
        return b.toString();
    }
    
    public static String toUnicodeEsq(int cp)
    {
        return isAscii(cp) ? "" + (char) cp : Character.isBmpCodePoint(cp) ? "\\u" + String.format("%04x", cp) : "\\u" + String.format("%04x", (int)Character.highSurrogate(cp)) + "\\u" + String.format("%04x", (int)Character.lowSurrogate(cp) );
    }

    public static boolean isAscii(int cp) 
    {
        return cp <= Byte.MAX_VALUE;
    }

My method doesn't support Unicode Numbers (U+hex) directly but, you can get the string individually one at a time from css, html, and unicode number formats

    /**
     * get the codepoint from the unicode number. from there you can convert it to a unicode escape sequence using {@link JavaUtil#getUnicodeEsq(int)}
     * "U+hex" for unicode number
     * "&#codePoint;" or "&#hex;" for html
     * "\hex" for css
     * "hex" for lazyness
     */
    public static int parseUnicodeNumber(String num)
    {
        num = num.toLowerCase();
        if(num.startsWith("u+"))
            num = num.substring(2);
        else if(num.startsWith("&#"))
            return num.startsWith("&#x") ? Integer.parseInt(num.substring(3, num.length() - 1), 16) : Integer.parseInt(num.substring(2, num.length() - 1)); 
        else if(num.startsWith("\\"))
            num = num.substring(1);
        return Integer.parseInt(num, 16);
    }
    
    /**
     * convert a unicode number directly to unicode escape sequence in java
     */
    public static String unicodeNumberToEsq(String num)
    {
        return toUnicodeEsq(parseUnicodeNumber(num));
    }

Upvotes: 1

Andreas
Andreas

Reputation: 159086

🦂 (SCORPION) is Unicode Code Point 1f982, which is UTF-16 d83e dd82, and UTF-8 f0 9f a6 82.

To convert the code point integer to Unicode-escaped Java String, run this code:

// Java 11+
int codePoint = 0x1f982;
char[] charArray = Character.toString(codePoint).toCharArray();
System.out.printf("\\u%04x\\u%04x", (int) charArray[0], (int) charArray[1]);
// prints: \ud83e\udd82
// Java 1.5+
int codePoint = 0x1f982;
char[] charArray = new String(new int[] { codePoint }, 0, 1).toCharArray();
System.out.printf("\\u%04x\\u%04x", (int) charArray[0], (int) charArray[1]);
// prints: \ud83e\udd82

Upvotes: 5

Related Questions