firk
firk

Reputation: 345

ASCII codes representation in char on non two's-complement platforms

It's all easy on mainstream platforms: letter 'A' have ASCII-code 65, so it is (char)65, also it is (unsigned char)65, also it is (signed char)65 which all leads to the same bit sequence in memory.

But as i know the C standard does not require signed numbers to be encoded using any specific scheme. So it is possible that on some machine (signed char)65 and (unsigned char)65 are represented via different bit sequences. (example: https://en.wikipedia.org/wiki/Offset_binary ) Am I right or this behaviour is prohibited somewhere in the standard?

If it is possible: which of them will be 'A' (for example in some generic text file editor)? Is it somehow connected to signedness of plain char type?

Is there a portable way to handle such cases at all?

Another side of the same problem.

Example I have char some_text[100]; and i want to read it as unsigned codes. There are two options:

(unsigned char)(some_text[i]) = will convert signed value to unsigned, keeping its numerical value when possible

*(unsigned char*)(some_text+i) = will keep bit sequence but value may be changed depending on platform

Which one will be more portable and correct while thinking about such exotic platforms described above?

Upvotes: 0

Views: 245

Answers (2)

n. m. could be an AI
n. m. could be an AI

Reputation: 119877

ASCII codes are numbers 0 to 127.

The C standard requires that the representation of these numbers is the same for signed and unsigned char types.

Values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation

and

signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type

These provisions allow one to safely convert between signed and unsigned char types, and (more importantly) between arrays thereof. These conversions behave predictably and portably. When an object of type signed char is accessed via an unsigned char lvalue, and the value of the original object is non-negative (all ASCII codes are), the accessed value is guaranteed to be the same as the original value. Conversely, if an unsigned char is accessed via a signed char lvalue, and the original value fits in the signed range (all ASCII codes do), it is guaranteed to be unchanged. This is important because various APIs often use character arrays of inconvenient signedness; we want to be sure we can use such APIs with a simple cast to/from our preferred character type.

What about negative values? These are not ASCII but we often work with other character sets and encodings (e.g. UTF-8) and those could have negative members.

Negative values can be represented with exactly one of the three methods.

If the sign bit is one, the value shall be modified in one of the following ways:

the corresponding value with sign bit 0 is negated (sign and magnitude);
the sign bit has the value -(2M) (two's complement);
the sign bit has the value -(2M-1) (ones' complement).

Here we have a problem with a negative zero in the sign and magnitude representation. It cannot survive a round trip via an unsigned type. It follows that some character encodings like UTF-8 cannot be readily supported by such an implementation. It is not a problem for ASCII though.

As for other integer types, the representation is not really important here. When you use e.g. int to represent an ASCII value, you are normally interested in the value, not in the representation. You can safely convert values 0 to 127 between all integer types supported by C. (Other integer types may have padding bits, but otherwise most of the above is true about them too; this is irrelevant because normal programming is almost never affected).

An exotic platform that uses a different char representation cannot support standard C, so writing portably for such platforms is not a meaningful proposition.

Finally, the same is true if you replace ASCII with whatever basic character set is actually used by the platform, except the range may be different.

Upvotes: 2

Lundin
Lundin

Reputation: 213842

First of all, char itself has implementation-defined signedness, so it could be either signed or unsigned, depending on compiler.

The value of any 7 bit character symbol code cast to either signed or unsigned will always be a positive value. When speaking of ASCII, we mean the original 7 bit table only. It can never have a negative value. Therefore, the underlying signedness representation is irrelevant, because the symbol value can never be negative as long as it isn't larger than 7 bits.

To summarize your questions:

So it is possible that on some machine (signed char)65 and (unsigned char)65 are represented via different bit sequences.

No.

Am I right or this behaviour is prohibited somewhere in the standard?

Yes, C17 6.3.1.3. "When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged."

The only code that will face portability issues is code relying on symbol tables that are 8 or more bits. But then wchar_t is typically used instead.

Upvotes: 1

Related Questions