Faustus
Faustus

Reputation: 267

Forcing Unicode in byte variable

I recently discovered that you convert a String to a byte array in the following manner:

String S = "ab";
byte arr[] = S.getBytes();

Now, I tried with String "\u9999" and the answer was [63]. I thought it would be 9999 (mod 128) = 15 which is actually what we get if we do byte b = 9999. What is the reason behind the 63?

Upvotes: 4

Views: 3515

Answers (2)

cshu
cshu

Reputation: 5954

It's about the default charset. It may have something to do with the encoding of your java file.

(On my machine, when I compile java file with encoding of cp1252, getBytes() seems to also use cp1252 as default charset. Since cp1252 doesn't support the unicode character, it becomes a ? character, i.e. 63. When I compile java with encoding of UTF-16, getBytes() returns the data 0x9999 as expected.)

The behavior of this method when this string cannot be encoded in the default charset is unspecified. (Source: getBytes() from oracle.com)

My suggestion is to simply use "\u9999".getBytes(StandardCharsets.UTF_16LE) (or UTF_16BE) to get the 2-byte array you desire. So there is no need to be concerned about encoding of java source. The array should be {-103,-103}.

byte with value of -103 is represented in memory as 0x99.

Upvotes: 1

p e p
p e p

Reputation: 6674

For Unicode characters, you can specify the encoding in the call to getBytes:

byte arr[] = S.getBytes("UTF8");

As far as why you are getting 63 as a result, the call to getBytes without a parameter uses your platform's default encoding. The character \u9999 cannot be properly represented in your default encoding, so that gets turned into ? which in ASCII has the decimal value 63.

Upvotes: 6

Related Questions