Reputation: 266

Java - Converting from unicode to a string?

I can easily create a unicode character and print it with the following lines of code

String uniChar = Character.toString((char)0000);
System.out.println(uniChar);

However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in "\u0000"? That way I could substring just the "0000", convert it to an int, add 3, and reverse the entire process.

Upvotes: 1

Answers (3)

T.J. Crowder

Reputation: 1075427

I think you're looking for String#codePointAt:

Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.

If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.

For instance (live copy):

// String containing smiling face with smiling eyes emoji
String str = "😊";
// Get the code point
int cp = str.codePointAt(0);
// Show it
System.out.println(str + ", code point = U+" + toHex(cp));
// Increase it
++cp;
// Get the updated string (from an array of code points)
String updated = new String(new int[] { cp }, 0, 1);
// Show it
System.out.println(updated + ", code point = U+" + toHex(cp));

(toHex is just return Integer.toString(n, 16).toUpperCase();)

That outputs:

😊, code point = U+1F60A

😋, code point = U+1F60B

Upvotes: 5

user07

Reputation: 435

This code will work in both cases, for codepoints from Unicode BMP and from Unicode supplemental panes which uses 4 bytes in UTF-8 to encode a character. 4 byte code point requires 2 Java char entities to be stored, so in this case string.length() = 2.

// array will contain one or two characters
char[] chars = Character.toChars(codePoint);

// string.length will be 1 or 2
String str = new String(chars);

Upvotes: 2

Joop Eggen

Reputation: 109613

Unicode is a numbering of "characters" - code points - upto a 3-byte int range.

The UTF-16 encoding uses a sequance of byte pairs, and a java char is such a byte pair. The (int) cast of a char is imperfect and covers only a part of the Unicode. The correct way to convert a code point to possibly more than one char:

int codePoint = 0x263B;
char[] chars = Character.chars(codePoint);

To work with Unicode code points, one can do:

int[] codePoints = {0x2639, 0x263a, 0x263b};
String s = new String(codePoints, 0, codePoints.length);
codePoints[0} += 2;

You code use an int array of 1 code point.

In java 8 one can get an IntStream of code points:

s.codePoints().forEach(cp -> {
    System.out.printf("U+%X = %s%n", cp, Character.getName(cp));
};

Upvotes: 1

Java - Converting from unicode to a string?

Answers (3)

Related Questions