Reputation: 639

Why is char different from both signed char and unsigned char?

cppreference.com states that char is

Equivalent to either signed char or unsigned char [...], but char is a distinct type, different from both signed char and unsigned char

I assume this means that a char can hold exactly the same values as either unsigned char or signed char, but is not compatible with either. Why was it decided to work this way? Why does unqualified char not denote a char of the platform-appropriate signedness, like with the other integer types, where int denotes exactly the same type as signed int?

Upvotes: 10

Answers (5)

Lundin

Reputation: 215115

The reason is backwards compatibility. Here is some research regarding the history behind it. It only uses authoritative first sources like the publications by Dennis M. Ritchie (the creator of C) or ISO.

In the beginning, there was only int and char. The early draft of C called "NB" for "new B" included these new types not present in the predecessors B and BCPL [Ritchie, 93]:

...it seemed that a typing scheme was necessary to cope with characters and byte addressing, and to prepare for the coming floating-point hardware

Embryonic C

NB existed so briefly that no full description of it was written. It supplied the types int and char, arrays of them, and pointers to them, declared in a style typified by
int i, j;
char c, d;

unsigned was added later [Ritchie, 93]:

During 1973-1980, the language grew a bit: the type structure gained unsigned, long...

Note that this refers to the stand-alone "type qualifier" unsigned at this point, equivalent to unsigned int.

Around this time in 1978, The C Programming Language 1st edition was published [Kernighan, 78] and in chapter 2.7 mentions type conversion problems related to char:

There is one subtle point about the conversion of characters to integers. The language does not specify whether variables of type char are signed or unsigned quantities. When a char is converted to an int, can it ever produce a negative integer? Unfortunately, this varies from machine to machine, reflecting differences in architecture. On some machines (PDP-11, for instance), a char whose leftmost bit is 1 will be converted to a negative integer (“sign extension”). On others, a char is promoted to an int by adding zeros at the left end, and thus is always positive.

At this point, the type promotion to int was what was described as problematic, not the signedness of char, which wasn't even specified. The above text remains mostly unchanged in the 2nd edition [Kernighan, 88].

However, the types themselves are described differently between editions. In the 1st edition [Kernighan, 78, 2.2], unsigned could only be applied to int and was regarded as a qualifier:

In addition, there are a number of qualifiers which can be applied to int’s: short, long, and unsigned.

Whereas the 2nd edition is in line with standard C [Kernighan, 88, 2.2]:

The qualifier signed or unsigned may be applied to char or any integer. /--/ Whether plain chars are signed or unsigned is machine-dependent, but printable characters are always positive.

So in between the 1st and 2nd edition, they had discovered a backwards compatibility problem with applying the new unsigned/signed (now called type specifiers and not qualifiers [ANSI/ISO, 90]) to the char type, with the same concerns as were already identified regarding type conversions back in the 1st edition.

This compatibility concern remained during standardization in the late 80s. We can read this from the various rationales such as [ISO, 98, 6.1.2.5 §30]

Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types. Two varieties of the integral types are specified: signed and unsigned. If neither specifier is used, signed is assumed. In the Base Document the only unsigned type is unsigned int.

This actually suggests that signed int was allowed to make int more symmetric with char, rather than the other way around.

Sources:

[ANSI/ISO, 90] ANSI/ISO 9899:1990 - Programming Languages - C
[ISO, 98] Rationale for International Standard - Programming Language - C, WG14/N802 J11/98-001
[Kernighan, 78] Kernighan, Brian W., Ritchie, Dennis M. - The C Programming Language, 1st edition (1978)
[Kernighan, 88] Kernighan, Brian W., Ritchie, Dennis M. - The C Programming Language, 2st edition (1988)
[Ritchie, 93] Ritchie, Dennis M. - The Development of the C Language (1993)

Upvotes: 1

Andrew Henle

Reputation: 1

The three C character types char, signed char, and unsigned char exist as codification of legacy C implementations and usage.

The XJ311 committee that codified C into the first C standard (now known as C89) stated their purpose in the Rational (italics original):

1.1 Purpose

The Committee's overall goal was to develop a clear, consistent, and unambiguous Standard for the C programming language which codifies the common, existing definition of C and which promotes the portability of user programs across C language environments.

The X3J11 charter clearly mandates the Committee to codify common existing practice. ...

N.B.: the X3J11 committee went out of their way to emphasize they were codifying existing implementations of C and common usage/practices in order to promote portability.

In other words, "standard" C was never created - existing C code, usages, and practices were codified.

Per 3.1.2.5 Types of that same Rationale (bolding mine):

Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. ...

The words of the committee are clear: three types of char exist because plain char had to be either signed or unsigned in order to match "prior practice". Plain char therefore had to be separate - portable code could not rely on plain char being signed or unsigned, but both signed char and unsigned char had to be available.

The three character types can not be compatible in any way because of portability concerns - and portability of standard-conforming C code was one of the XJ311 committee's main goals.

If extern char buffer[10] were compatible with unsigned char buffer[10] on a system where plain char is unsigned, the code would behave differently if the code were compiled* on a system where plain char is signed and therefore incompatible with unsigned char buffer[10]. For example, bit shifting elements of buffer would change behavior depending on whether or not buffer were accessed through the extern char buffer[10] declaration or the unsigned char buffer[10]; definition, breaking portability.

The fact that char could already be signed or unsigned with different behavior in such a situation already existed, and the committee could not change that without violating their goal to "codif[y] the common, existing definition of C".

But with a goal of promoting portability, there was no reason whatsoever to create a crazed, portability-nightmare-inducing situation where "sometimes char is compatible with this and not that, and sometimes char is compatible with that and not this".

* - If the code compiled at all - but this is a hypothetical meant to demonstrate why the three char types must be incompatible.

Upvotes: 8

klutt

Reputation: 31439

TL;DR

Backwards compatibility. Probably. Or possibly that they had to choose and didn't care. But I have no certain answer.

Long version

Intro

Just like OP, I'd prefer a certain answer from a reliable source. In the absence of that, qualified guesses and speculations are better than nothing.

Very many things in C comes from backwards compatibility. When it was decided that whether char would be the same as signed char or unsigned char is implementation defined, there were already a lot of C code out there, some of which was using signed chars and others using unsigned. Forcing it to be one or the other would for certain break some code.

Why it (probably) does not matter

Why does unqualified char not denote a char of the platform-appropriate signedness

It does not matter much. An an implementation that is using signed chars guarantees that CHAR_MIN is equal to SCHAR_MIN and that CHAR_MAX is equal to SCHAR_MAX. Same goes for unsigned. So an unqualified char will always have the exact same range as its qualified counterpart.

From the standard 5.2.4.2.1p2:

If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX.

This points us in the direction that they just didn't really care, or that it "feels safer".

Another interesting mention in the C standard is this:

All enumerations have an underlying type. The underlying type can be explicitly specified using an enum type specifier and is its fixed underlying type. If it is not explicitly specified, the underlying type is the enumeration’s compatible type, which is either a signed or unsigned integer type (excluding the bit-precise integer types), or char.

Possible problems with breaking this (speculation)

I'm trying to come up with a scenario where this would actually matter. One that could possibly cause issues is if you compile a source file to a shared library with one compiler using signed char and then use that library in a source file compiled with another compiler using unsigned char.

And even if that would not cause problems, imagine that the shared library is compiled with a pre-ansi compiler. Well, I cannot say for certain that this would cause problems either. But I can imagine that it could.

And another speculation from Steve Summit in comment section:

I'm speculating, but: if the Standard had required, in Eric's phrasing, "char is the same type as an implementation-defined choice of signed char or unsigned char", then if I'm on a platform on which char is the same as signed char, I can intermix the two with no warnings, and create code that's not portable to a machine where char is unsigned by default. So the definition "char is a distinct type from signed char and unsigned char" helps force people to write portable code.

Backwards compatibility is a sacred feature

But remember that the persons behind the C standard were and are VERY concerned about not breaking backwards compatibility. Even to the point that they don't want to change the signature of some library functions to return const values because it would yield warnings. Not errors. Warnings! Warnings that you can easily disable. Instead, they just wrote in the standard that it's undefined behavior to modify the values. You can read more about that here: https://thephd.dev/your-c-compiler-and-standard-library-will-not-help-you

So whenever you encounter very strange design choices in the C standard, it's a very good bet that backwards compatibility is the reason. That's the reason why you can initialize a pointer to NULL with just 0, even for a machine where NULL is not the zero address. And why bool is a macro of the keyword _Bool.

It's also the reason why bitwise | and & has higher precedence than ==, because there were a lot (several hundred kilobytes that was installed on three (3) machines :) ) of source code including stuff like if (a==b & c==d). Dennis Ritchie admitted that he should have changed it. https://www.lysator.liu.se/c/dmr-on-or.html

So we can at least say for certain that there are design choice made with backwards compatibility in mind, that has later been admitted by those who made the choices to be mistakes and that we have reliable sources for that.

C++

And also remember that your sources points to C++ sources. In that language, there are reasons that don't apply to C. Like overloading.

Upvotes: 7

Chris Dodd

Reputation: 126527

The line you quote actually does not come from the C standard at all, but is rather comes from the C++ standard. The website you link to (cppreference.com) is primarily about C++ and the C stuff there is something of an afterthought.

The reason this is important for C++ (and not really for C) is that C++ allows overloading based on types, but you can only overload distinct types. The fact the char must be distinct from both signed char and unsigned char means you can safely overload all three:

// 3 overloads for fn
void fn(char);
void fn(signed char);
void fn(unsigned char);

and you will not get an error about ambiguous overloading or such.

Upvotes: 0

Jonathan Leffler

Reputation: 755006

One part of the reasoning for not mandating either signed or unsigned for plain char is the EBCDIC code set used on IBM mainframes in particular.

In §6.2.5 Types ¶3, the C standard says:

An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

^{Emphasis added.}

Now, in EBCDIC, the lower-case letters have the code points 0x81-0x89, 0x91-0x99, 0xA2-0xA9; the upper-case letters have the code points 0xC1-0xC9, 0xD1-0xD9, 0xE2-0xE9; and the digits have the code points 0xF0-0xF9. So:

The alphabets are not contiguous.
Lower-case letters sort before upper-case letters.
Digits sort higher than letters.
And because of §6.2.5¶3, the type of plain char has to be unsigned.

Each of the first three points is in contradistinction to ASCII (and ISO 8859, and ISO 10646 aka Unicode).

Upvotes: 4

Why is char different from *both* signed char and unsigned char?