Reputation: 632
cppreference.com states that char
is
Equivalent to either
signed char
orunsigned char
[...], butchar
is a distinct type, different from bothsigned char
andunsigned char
I assume this means that a char
can hold exactly the same values as either unsigned char
or signed char
, but is not compatible with either. Why was it decided to work this way? Why does unqualified char
not denote a char
of the platform-appropriate signedness, like with the other integer types, where int
denotes exactly the same type as signed int
?
Upvotes: 10
Views: 1277
Reputation: 213458
The reason is backwards compatibility. Here is some research regarding the history behind it. It only uses authoritative first sources like the publications by Dennis M. Ritchie (the creator of C) or ISO.
In the beginning, there was only int
and char
. The early draft of C called "NB" for "new B" included these new types not present in the predecessors B and BCPL [Ritchie, 93]:
...it seemed that a typing scheme was necessary to cope with characters and byte addressing, and to prepare for the coming floating-point hardware
Embryonic C
NB existed so briefly that no full description of it was written. It supplied the types
int
andchar
, arrays of them, and pointers to them, declared in a style typified byint i, j; char c, d;
unsigned
was added later [Ritchie, 93]:
During 1973-1980, the language grew a bit: the type structure gained
unsigned
,long
...
Note that this refers to the stand-alone "type qualifier" unsigned
at this point, equivalent to unsigned int
.
Around this time in 1978, The C Programming Language 1st edition was published [Kernighan, 78] and in chapter 2.7 mentions type conversion problems related to char
:
There is one subtle point about the conversion of characters to integers. The language does not specify whether variables of type
char
are signed or unsigned quantities. When achar
is converted to anint
, can it ever produce a negative integer? Unfortunately, this varies from machine to machine, reflecting differences in architecture. On some machines (PDP-11, for instance), achar
whose leftmost bit is 1 will be converted to a negative integer (“sign extension”). On others, achar
is promoted to anint
by adding zeros at the left end, and thus is always positive.
At this point, the type promotion to int
was what was described as problematic, not the signedness of char
, which wasn't even specified. The above text remains mostly unchanged in the 2nd edition [Kernighan, 88].
However, the types themselves are described differently between editions. In the 1st edition [Kernighan, 78, 2.2], unsigned
could only be applied to int
and was regarded as a qualifier:
In addition, there are a number of qualifiers which can be applied to
int
’s:short
,long
, andunsigned
.
Whereas the 2nd edition is in line with standard C [Kernighan, 88, 2.2]:
The qualifier
signed
orunsigned
may be applied tochar
or any integer. /--/ Whether plainchar
s are signed or unsigned is machine-dependent, but printable characters are always positive.
So in between the 1st and 2nd edition, they had discovered a backwards compatibility problem with applying the new unsigned
/signed
(now called type specifiers and not qualifiers [ANSI/ISO, 90]) to the char
type, with the same concerns as were already identified regarding type conversions back in the 1st edition.
This compatibility concern remained during standardization in the late 80s. We can read this from the various rationales such as [ISO, 98, 6.1.2.5 §30]
Three types of
char
are specified:signed
, plain, andunsigned
. A plainchar
may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The typesigned char
was introduced to make available a one-byte signed integer type on those systems which implement plainchar
as unsigned. For reasons of symmetry, the keywordsigned
is allowed as part of the type name of other integral types. Two varieties of the integral types are specified:signed
andunsigned
. If neither specifier is used, signed is assumed. In the Base Document the only unsigned type isunsigned int
.
This actually suggests that signed int
was allowed to make int
more symmetric with char
, rather than the other way around.
Sources:
Upvotes: 1
Reputation: 1
The three C character types char
, signed char
, and unsigned char
exist as codification of legacy C implementations and usage.
The XJ311 committee that codified C into the first C standard (now known as C89) stated their purpose in the Rational (italics original):
1.1 Purpose
The Committee's overall goal was to develop a clear, consistent, and unambiguous Standard for the C programming language which codifies the common, existing definition of C and which promotes the portability of user programs across C language environments.
The X3J11 charter clearly mandates the Committee to codify common existing practice. ...
N.B.: the X3J11 committee went out of their way to emphasize they were codifying existing implementations of C and common usage/practices in order to promote portability.
In other words, "standard" C was never created - existing C code, usages, and practices were codified.
Per 3.1.2.5 Types of that same Rationale (bolding mine):
Three types of char are specified:
signed
, plain, andunsigned
. A plain char may be represented as eithersigned
orunsigned
, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. ...
The words of the committee are clear: three types of char
exist because plain char
had to be either signed
or unsigned
in order to match "prior practice". Plain char
therefore had to be separate - portable code could not rely on plain char
being signed or unsigned, but both signed char
and unsigned char
had to be available.
The three character types can not be compatible in any way because of portability concerns - and portability of standard-conforming C code was one of the XJ311 committee's main goals.
If extern char buffer[10]
were compatible with unsigned char buffer[10]
on a system where plain char
is unsigned, the code would behave differently if the code were compiled* on a system where plain char
is signed and therefore incompatible with unsigned char buffer[10]
. For example, bit shifting elements of buffer
would change behavior depending on whether or not buffer
were accessed through the extern char buffer[10]
declaration or the unsigned char buffer[10];
definition, breaking portability.
The fact that char
could already be signed or unsigned with different behavior in such a situation already existed, and the committee could not change that without violating their goal to "codif[y] the
common, existing definition of C".
But with a goal of promoting portability, there was no reason whatsoever to create a crazed, portability-nightmare-inducing situation where "sometimes char
is compatible with this and not that, and sometimes char
is compatible with that and not this".
* - If the code compiled at all - but this is a hypothetical meant to demonstrate why the three char
types must be incompatible.
Upvotes: 8
Reputation: 31296
Backwards compatibility. Probably. Or possibly that they had to choose and didn't care. But I have no certain answer.
Just like OP, I'd prefer a certain answer from a reliable source. In the absence of that, qualified guesses and speculations are better than nothing.
Very many things in C comes from backwards compatibility. When it was decided that whether char
would be the same as signed char
or unsigned char
is implementation defined, there were already a lot of C code out there, some of which was using signed chars and others using unsigned. Forcing it to be one or the other would for certain break some code.
Why does unqualified char not denote a char of the platform-appropriate signedness
It does not matter much. An an implementation that is using signed chars guarantees that CHAR_MIN
is equal to SCHAR_MIN
and that CHAR_MAX
is equal to SCHAR_MAX
. Same goes for unsigned. So an unqualified char
will always have the exact same range as its qualified counterpart.
From the standard 5.2.4.2.1p2:
If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX.
This points us in the direction that they just didn't really care, or that it "feels safer".
Another interesting mention in the C standard is this:
All enumerations have an underlying type. The underlying type can be explicitly specified using an enum type specifier and is its fixed underlying type. If it is not explicitly specified, the underlying type is the enumeration’s compatible type, which is either a signed or unsigned integer type (excluding the bit-precise integer types), or
char
.
I'm trying to come up with a scenario where this would actually matter. One that could possibly cause issues is if you compile a source file to a shared library with one compiler using signed char and then use that library in a source file compiled with another compiler using unsigned char.
And even if that would not cause problems, imagine that the shared library is compiled with a pre-ansi compiler. Well, I cannot say for certain that this would cause problems either. But I can imagine that it could.
And another speculation from Steve Summit in comment section:
I'm speculating, but: if the Standard had required, in Eric's phrasing, "
char
is the same type as an implementation-defined choice ofsigned char
orunsigned char
", then if I'm on a platform on whichchar
is the same assigned char
, I can intermix the two with no warnings, and create code that's not portable to a machine wherechar
is unsigned by default. So the definition "char
is a distinct type fromsigned char
andunsigned char
" helps force people to write portable code.
But remember that the persons behind the C standard were and are VERY concerned about not breaking backwards compatibility. Even to the point that they don't want to change the signature of some library functions to return const
values because it would yield warnings. Not errors. Warnings! Warnings that you can easily disable. Instead, they just wrote in the standard that it's undefined behavior to modify the values. You can read more about that here: https://thephd.dev/your-c-compiler-and-standard-library-will-not-help-you
So whenever you encounter very strange design choices in the C standard, it's a very good bet that backwards compatibility is the reason. That's the reason why you can initialize a pointer to NULL
with just 0
, even for a machine where NULL is not the zero address. And why bool
is a macro of the keyword _Bool
.
It's also the reason why bitwise |
and &
has higher precedence than ==
, because there were a lot (several hundred kilobytes that was installed on three (3) machines :) ) of source code including stuff like if (a==b & c==d)
. Dennis Ritchie admitted that he should have changed it. https://www.lysator.liu.se/c/dmr-on-or.html
So we can at least say for certain that there are design choice made with backwards compatibility in mind, that has later been admitted by those who made the choices to be mistakes and that we have reliable sources for that.
And also remember that your sources points to C++ sources. In that language, there are reasons that don't apply to C. Like overloading.
Upvotes: 7
Reputation: 126185
The line you quote actually does not come from the C standard at all, but is rather comes from the C++ standard. The website you link to (cppreference.com) is primarily about C++ and the C stuff there is something of an afterthought.
The reason this is important for C++ (and not really for C) is that C++ allows overloading based on types, but you can only overload distinct types. The fact the char
must be distinct from both signed char
and unsigned char
means you can safely overload all three:
// 3 overloads for fn
void fn(char);
void fn(signed char);
void fn(unsigned char);
and you will not get an error about ambiguous overloading or such.
Upvotes: 0
Reputation: 753525
One part of the reasoning for not mandating either signed or unsigned for plain char
is the EBCDIC code set used on IBM mainframes in particular.
In §6.2.5 Types ¶3, the C standard says:
An object declared as type
char
is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in achar
object, its value is guaranteed to be nonnegative.
Emphasis added.
Now, in EBCDIC, the lower-case letters have the code points 0x81-0x89, 0x91-0x99, 0xA2-0xA9; the upper-case letters have the code points 0xC1-0xC9, 0xD1-0xD9, 0xE2-0xE9; and the digits have the code points 0xF0-0xF9. So:
char
has to be unsigned.Each of the first three points is in contradistinction to ASCII (and ISO 8859, and ISO 10646 aka Unicode).
Upvotes: 4