Pokechu22
Pokechu22

Reputation: 5046

What is the propper way to get a char's code point?

I need to do some stuff with codepoints and a newline. I have a function that takes a char's codepoint, and if it is \r it needs to behave differently. I've got this:

if (codePoint == Character.codePointAt(new char[] {'\r'}, 0)) {

but that is very ugly and certainly not the right way to do it. What is the correct method of doing this?

(I know that I could hardcode the number 13 (decimal identifier for \r) and use that, but doing that would make it unclear what I am doing...)

Upvotes: 4

Views: 1149

Answers (3)

David Conrad
David Conrad

Reputation: 16359

I know this question is old, but neither of the existing answers actually answers the question, including the accepted answer.

You can simply compare a code point with a char directly.

if (codePoint == '\r')

Upvotes: 0

Jon Skeet
Jon Skeet

Reputation: 1500385

If you know that all your input is going to be in the Basic Multilingual Plane (U+0000 to U+FFFF) then you can just use:

char character = 'x';
int codePoint = character;

That uses the implicit conversion from char to int, as specified in JLS 5.1.2:

19 specific conversions on primitive types are called the widening primitive conversions:

  • ...
  • char to int, long, float, or double

...

A widening conversion of a char to an integral type T zero-extends the representation of the char value to fill the wider format.

However, a char is only a UTF-16 code unit. The point of Character.codePointAt is that it copes with code points outside the BMP, which are composed of a surrogate pair - two UTF-16 code units which join together to make a single character.

From JLS 3.1:

The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

If you need to be able to cope with that more complicated situation, you'll need the more complicated code.

Upvotes: 6

Elliott Frisch
Elliott Frisch

Reputation: 201439

If I understand your question, you could simply cast the char to an int, something like this

char ch = '\r';
int codePoint = (int) ch;
System.out.println(codePoint);

Output is

13

Upvotes: 4

Related Questions