Reputation: 2425
TL;DR: In Java, will casting a character obtained from a String via the charAt
method to a byte always yield the same value?
I am reading files which are encoded with arbitrary (unknown to us) character encodings. I need to parse these files and look for certain words, e.g. "TAG". I placed certain restrictions on the file contents, such as "when looking for a tag, the bytes for "TAG" must be the same as their ASCII representation".
For example, suppose I have the following file:
0x00 0x11 0x22 0x33 0x54 0x41 0x47 0x77 0x88 0x99 0xaa 0xbb
Since the ASCII values for T, A and G are respectively 0x54
, 0x41
and 0x47
, I can find "TAG" in the file by parsing the bytes themselves.
0x00 0x11 0x22 0x33
0x54 0x41 0x47
0x77 0x88 0x99 0xaa 0xbb
However, I need to hard-code the value of the bytes I am looking for. To do this, I call String
's charAt(int i)
method and cast the char to a byte.
Here is, for example, how I would verify an arbitrary byte (called b
) for the byte representation of 'T':
String tag = "TAG";
char t = tag.charAt(0);
if ((byte)t == b){
//magic goes here, such as comparing the 'A' and the 'G'
}
Note: the code is not actually like that, and the verification algorithm is much more elegant.
This works fine on my local machine. However, this will be run on machines which may contain very strange encodings. What worries me is whether casting a character obtained with charAt
to a byte might yield a different value depending on the machine. I know that Java always encodes char
s with the UTF-16 character encoding, but I am worried that when converting from a String to a character and then to a byte might yield strange results.
So, in short, will casting a character obtained from a String via the charAt
method to a byte always yield the same value? Or will it depend on an external factor?
Thanks for your help!
Note: I cannot hard-code the bytes themselves (in, for example, a byte array) since they can be very very long and may be changed very often in the future.
Upvotes: 1
Views: 1170
Reputation: 208
Instead of typecasting them directly, you could use the Character.codePointAt(char c)
method. This should guarantee you the same result every time.
Upvotes: 0
Reputation: 3459
Yes charAt (int)
returns a Java defined char type (UTF-16) and is therefore always the same casted to byte
.
In contrary String.getBytes()
returns the bytes depending either on the specified charset or on the default charset of the OS if none is specified.
Upvotes: 2
Reputation: 4744
java.lang.string.charAt
will always return a 16 bit UTF-16 character, which will always be the same when you cast it to a byte, though because char
is a 16-bit unsigned data type, casting it as an 8-bit signed byte
might give you unwanted behavior. However if your source data is ASCII, you will get exactly the type of behavior you expect.
Upvotes: 3
Reputation: 533730
Conversion of a char to a byte with (byte)
will give you the same result on all system.
However, it is very rare that you need to mix char
and byte
. You should really use one or the other. Mixing the concepts can lead to confusion as you suspect.
Upvotes: 0