Jonathan Pitre
Jonathan Pitre

Reputation: 2425

Java: Is the value of a String's characters (in bytes) constant?

TL;DR: In Java, will casting a character obtained from a String via the charAt method to a byte always yield the same value?

I am reading files which are encoded with arbitrary (unknown to us) character encodings. I need to parse these files and look for certain words, e.g. "TAG". I placed certain restrictions on the file contents, such as "when looking for a tag, the bytes for "TAG" must be the same as their ASCII representation".

For example, suppose I have the following file:
0x00 0x11 0x22 0x33 0x54 0x41 0x47 0x77 0x88 0x99 0xaa 0xbb
Since the ASCII values for T, A and G are respectively 0x54, 0x41 and 0x47, I can find "TAG" in the file by parsing the bytes themselves.
0x00 0x11 0x22 0x330x54 0x41 0x470x77 0x88 0x99 0xaa 0xbb

However, I need to hard-code the value of the bytes I am looking for. To do this, I call String's charAt(int i) method and cast the char to a byte.

Here is, for example, how I would verify an arbitrary byte (called b) for the byte representation of 'T':
String tag = "TAG";
char t = tag.charAt(0);
if ((byte)t == b){
        //magic goes here, such as comparing the 'A' and the 'G'
}
Note: the code is not actually like that, and the verification algorithm is much more elegant.

This works fine on my local machine. However, this will be run on machines which may contain very strange encodings. What worries me is whether casting a character obtained with charAt to a byte might yield a different value depending on the machine. I know that Java always encodes chars with the UTF-16 character encoding, but I am worried that when converting from a String to a character and then to a byte might yield strange results.

So, in short, will casting a character obtained from a String via the charAt method to a byte always yield the same value? Or will it depend on an external factor?

Thanks for your help!

Note: I cannot hard-code the bytes themselves (in, for example, a byte array) since they can be very very long and may be changed very often in the future.

Upvotes: 1

Views: 1170

Answers (4)

Tamoghna Chowdhury
Tamoghna Chowdhury

Reputation: 208

Instead of typecasting them directly, you could use the Character.codePointAt(char c) method. This should guarantee you the same result every time.

Upvotes: 0

keiki
keiki

Reputation: 3459

Yes charAt (int) returns a Java defined char type (UTF-16) and is therefore always the same casted to byte.

In contrary String.getBytes() returns the bytes depending either on the specified charset or on the default charset of the OS if none is specified.

Upvotes: 2

Hans Z
Hans Z

Reputation: 4744

java.lang.string.charAt will always return a 16 bit UTF-16 character, which will always be the same when you cast it to a byte, though because char is a 16-bit unsigned data type, casting it as an 8-bit signed byte might give you unwanted behavior. However if your source data is ASCII, you will get exactly the type of behavior you expect.

Upvotes: 3

Peter Lawrey
Peter Lawrey

Reputation: 533730

Conversion of a char to a byte with (byte) will give you the same result on all system.

However, it is very rare that you need to mix char and byte. You should really use one or the other. Mixing the concepts can lead to confusion as you suspect.

Upvotes: 0

Related Questions