Confluence
Confluence

Reputation: 1341

How can I safely use a Java byte as an unsigned char?

I am porting some C code that uses a lot of bit manipulation into Java. The C code operates under the assumption that int is 32 bits wide and char is 8 bits wide. There are assertions in it that check whether those assumptions are valid.

I have already come to terms with the fact that I'll have to use long in place of unsigned int. But can I safely use byte as a replacement for unsigned char?

They merely represent bytes, but I have already run into this bizarre incident: (data is an unsigned char * in C and a byte[] in Java):

/* C */
uInt32 c = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3];

/* Java */
long a = ((data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]) & 0xffffffff;
long b = ((data[0] & 0xff) << 24) | ((data[1] & 0xff) << 16) |
          ((data[2] & 0xff) << 8) | (data[3] & 0xff) & 0xffffffff;

You would think a left shift operation is safe. But due strange unary promotion rules in Java, a and b are not going to be the same if some of the bytes in data are "negative" (b gives the correct result).

What other "gotchas" should I be aware of? I really don't want to use short here.

Upvotes: 5

Views: 2547

Answers (2)

autistic
autistic

Reputation: 15632

... can I safely use byte as a replacement for unsigned char?

As you've discovered, not really... No.

According to Oracle Java documentation, byte is a signed integer type, and though it has 256 distinct values (due to the explicit range specification "It has a minimum value of -128 and a maximum value of 127 (inclusive)" from the documentation) there are values that an unsigned char from C can store, that a byte from Java can't (and vice-versa).

That explains the problem you've experienced. However, the extent of the problem hasn't been fully demonstrated on your 8-bit-byte implementation.


What other "gotchas" should I be aware of?

Whilst a byte in Java is required to have support for only values between (and including) -128 and 127, Cs unsigned char has maximum value (UCHAR_MAX) that depends upon the number of bits used to represent it (CHAR_BIT; at least 8). So when CHAR_BIT is greater than 8, there will be extra values beyond 255 that unsigned char can store.


In summary, in the world of Java a byte should really be called an octet (a group of eight bits) where-as in C a byte (char, signed char, unsigned char) is a group of at least (possibly more than) eight bits.

No. They are not equivalent. I don't think you'll find an equivalent type in Java, either; they're all rather fixed-width. You could safely use byte in Java as an equivalent for int8_t in C, however (except that int8_t isn't required to exist in C unless CHAR_BIT == 8).


As for pitfalls, there are some in your C code too. Assuming data[0] is an unsigned char, data[0] << 24 is undefined behaviour on any system for which INT_MAX == 32767.

Upvotes: -1

You can safely use a byte to represent a value between 0 and 255 if you make sure to bitwise-AND its value with 255 (or 0xFF) before using it in computations. This promotes it to an int, and ensures the promoted value is between 0 and 255.

Otherwise, integer promotion would result in an int value between -128 and 127, using sign extension. -127 as a byte (hex 0x81) would become -127 as an int (hex 0xFFFFFF81).

So you can do this:

long a = (((data[0] & 255) << 24) | ((data[1] & 255) << 16) | ((data[2] & 255) << 8) | (data[3] & 255)) & 0xffffffff;

Note that the first & 255 is unnecessary here, since a later step masks off the extra bits anyway (& 0xffffffff). But it's probably simplest to just always include it.

Upvotes: 5

Related Questions