Reputation: 8091

What causes a char to be signed or unsigned when using gcc?

What causes if a char in C (using gcc) is signed or unsigned? I know that the standard doesn't dictate one over the other and that I can check CHAR_MIN and CHAR_MAX from limits.h but I want to know what triggers one over the other when using gcc

If I read limits.h from libgcc-6 I see that there is a macro __CHAR_UNSIGNED__ which defines a "default" char signed or unsigned but I'm unsure if this is set by the compiler at (his) built time.

I tried to list GCCs predefined makros with

$ gcc -dM -E -x c /dev/null | grep -i CHAR
#define __UINT_LEAST8_TYPE__ unsigned char
#define __CHAR_BIT__ 8
#define __WCHAR_MAX__ 0x7fffffff
#define __GCC_ATOMIC_CHAR_LOCK_FREE 2
#define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
#define __SCHAR_MAX__ 0x7f
#define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
#define __UINT8_TYPE__ unsigned char
#define __INT8_TYPE__ signed char
#define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
#define __CHAR16_TYPE__ short unsigned int
#define __INT_LEAST8_TYPE__ signed char
#define __WCHAR_TYPE__ int
#define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
#define __SIZEOF_WCHAR_T__ 4
#define __INT_FAST8_TYPE__ signed char
#define __CHAR32_TYPE__ unsigned int
#define __UINT_FAST8_TYPE__ unsigned char

but wasn't able to find __CHAR_UNSIGNED__

Background: I've some code which I compile on two different machines:

Desktop PC:

Debian GNU/Linux 9.1 (stretch)
gcc version 6.3.0 20170516 (Debian 6.3.0-18)
Intel(R) Core(TM) i3-4150
libgcc-6-dev: 6.3.0-18
char is signed

Raspberry Pi3:

Raspbian GNU/Linux 9.1 (stretch)
gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1)
ARMv7 Processor rev 4 (v7l)
libgcc-6-dev: 6.3.0-18+rpi
char is unsigned

So the only obvious difference is the CPU architecture...

Upvotes: 51

Answers (7)

plugwash

Reputation: 10514

https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html says

__CHAR_UNSIGNED__

GCC defines this macro if and only if the data type char is unsigned on the > target machine. It exists to cause the standard header file limits.h to work > correctly. You should not use this macro yourself; instead, refer to the standard macros defined in limits.h.

So it seems the reason you did not see it in your list was you were testing on a system were char is signed and the macro is not defined at all on such systems. I have confirmed it does appear in the output of cc -dM -E -x c /dev/null | grep -i CHAR on one of my arm systems an arm system.

The C standard leaves it up to the implementation, of course that doesn't really say much because "implementation" lumps together a bunch of things, compiler, OS, CPU architecture etc.

On Linux it depends on the CPU family. For some architectures there are or were good reasons for this. For example early arm had no real support for signed bytes. For others it seems to be more arbitrary, possibly copied from other operating systems that ran on the same hardware.

Afaict windows and mac OS use signed chars on all target architectures (or at least all that are currently supported).

Upvotes: 0

Peter Cordes

Reputation: 365792

On x86-64 Linux at least, it's defined by the x86-64 System V psABI

Other platforms will have similar ABI standards documents that specify the rules that let different C compilers agree with each other on calling conventions, struct layouts, and stuff like that. (See the x86 tag wiki for links to other x86 ABI docs, or other places for other architectures. Most non-x86 architectures have only one or two standard ABIs.)

From the x86-64 SysV ABI: Figure 3.1: Scalar Types

   C            sizeof      Alignment       AMD64
                            (bytes)         Architecture

_Bool*          1             1              boolean
-----------------------------------------------------------
char            1             1              signed byte
signed char
---------------------------------------------------------
unsigned char   1             1              unsigned byte
----------------------------------------------------------
...
-----------------------------------------------------------
int             4             4              signed fourbyte
signed int
enum***
-----------------------------------------------------------
unsigned int    4             4              unsigned fourbyte
--------------------------------------------------------------
...

* This type is called bool in C++.

*** C++ and some implementations of C permit enums larger than an int. The underlying type is bumped to an unsigned int, long int or unsigned long int, in that order.

Whether char is signed or not does actually directly affect the calling convention in this case, because of a currently-undocumented requirement which clang relies on: narrow types are sign or zero-extended to 32 bit when passed as function args, according to the callee prototype.

So for int foo(char c) { return c; }, clang will rely on the caller to have sign-extended the arg. (code + asm for this and a caller on Godbolt).

gcc:
    movsx   eax, dil       # sign-extend low byte of first arg reg into eax
    ret

clang:
    mov     eax, edi       # copy whole 32-bit reg
    ret

Even apart from the calling convention, C compilers have to agree so they compile inline functions in a .h the same way.

If (int)(char)x behaved differently in different compilers for the same platform, they wouldn't really be compatible.

Upvotes: 7

Davislor

Reputation: 15164

One important practical note is that the type of a UTF-8 string literal, such as u8"...", is an array of char, and it must be stored in UTF-8 format. Characters in the basic set are guaranteed to be equivalent to positive integers. However,

If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

(In C++, the type of the UTF-8 string constant is const char [] and it is not specified whether characters outside the basic set have numeric representations at all.)

Therefore, if your program needs to twiddle the bits of a UTF-8 string, you would need to use unsigned char. Otherwise, any code that checks whether the bytes of a UTF-8 string are in a certain range will not be portable.

It’s better to explicitly cast to unsigned char* than to write char and expect the programmer to compile with the right settings to configure that as unsigned char. However, you might use a static_assert() to test whether the range of char includes all the numbers from 0 to 255.

Upvotes: 1

Jonathan Leffler

Reputation: 755006

The default depends on the platform and native codeset. For example, machines that use EBCDIC (mainframes usually) must use unsigned char (or have CHAR_BIT > 8) because the C standard requires characters in the basic codeset to be positive, and EBCDIC uses codes like 240 for digit 0. (C11 standard, §6.2.5 Types ¶2 says: An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.)

You can control which sign GCC uses with -fsigned-char or -funsigned-char options. Whether that’s a good idea is a separate discussion.

Upvotes: 42

msc

Reputation: 34678

Character type char to be signed or unsigned, depending on the platform and compiler.

According to this reference link :

The C and C++ standards allows the character type char to be signed or unsigned, depending on the platform and compiler.

Most systems, including x86 GNU/Linux and Microsoft Windows, use signed char,

but those based on PowerPC and ARM processors typically use unsigned char.(29)

This can lead to unexpected results when porting programs between platforms which have different defaults for the type of char.

GCC provides the options -fsigned-char and -funsigned-char to set the default type of char.

Upvotes: 13

Basile Starynkevitch

Reputation: 1

According to the C11 standard (read n1570), char can be signed or unsigned (so you actually have two flavors of C). What exactly it is is implementation specific.

Some processors and instruction set architectures or application binary interfaces favor a signed character (byte) type (e.g. because it maps nicely to some machine code instruction), other favor an unsigned one.

gcc has even some -fsigned-char or -funsigned-char option which you should almost never use (because changing it breaks some corner cases in calling conventions and ABIs) unless you recompile everything, including your C standard library.

You could use feature_test_macros(7) and <endian.h> (see endian(3)) or autoconf on Linux to detect what your system has.

In most cases, you should write portable C code, which does not depend upon those things. And you can find cross-platform libraries (e.g. glib) to help you in that.

BTW gcc -dM -E -x c /dev/null also gives __BYTE_ORDER__ etc, and if you want an unsigned 8 bit byte you should use <stdint.h> and its uint8_t (more portable and more readable). And standard limits.h defines CHAR_MIN and SCHAR_MIN and CHAR_MAX and SCHAR_MAX (you could compare them for equality to detect signed chars implementations), etc...

BTW, you should care about character encoding, but most systems today use UTF-8 everywhere. Libraries like libunistring are helpful. See also this and remember that practically speaking an Unicode character encoded in UTF-8 can span several bytes (i.e. char-s).

Upvotes: 53

n. m. could be an AI

Reputation: 120079

gcc has two compile time options that control the behaviour of char:

-funsigned-char
-fsigned-char

It is not recommended to use any of these options unless you know exactly what you are doing.

The default is platform-dependent and is fixed when gcc itself is built. It is chosen for best compatibility with other tools that exist on that platform.

Source.

Upvotes: 6

What causes a char to be signed or unsigned when using gcc?

Answers (7)

Related Questions