jww
jww

Reputation: 102296

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

I'm looking at the following code to test for a hexadecimal digit and convert it to an integer. The code is kind of clever in that it takes advantage of difference between between capital and lower letters is 32, and that's bit 5. So the code performs one extra OR, but saves one JMP and two CMPs.

static const int BIT_FIVE = (1 << 5);
static const char str[] = "0123456789ABCDEFabcdef";

for (unsigned int i = 0; i < COUNTOF(str); i++)
{
    int digit, ch = str[i];

    if (ch >= '0' && ch <= '9')
        digit = ch - '0';
    else if ((ch |= BIT_FIVE) >= 'a' && ch <= 'f')
        digit = ch - 'a' + 10;
    ...
}

Do C and C++ guarantee the ASCII or values of [a-f] and [A-F] characters? Here, guarantee means the upper and lower character sets will always differ by a constant value that can be represented by a bit (for the trick above). If not, what does the standard say about them?

(Sorry for the C and C++ tag. I'm interested in both language's position on the subject).

Upvotes: 28

Views: 4665

Answers (3)

Keith Thompson
Keith Thompson

Reputation: 263317

No, it does not.

The C standard guarantees that the decimal digits and uppercase and lowercase letters exist, along with a number of other characters. It also guarantees that the decimal digits are contiguous, for example '0' + 9 == '9', and that all members of the basic execution character set have non-negative values. It specifically does not guarantee that the letters are contiguous. (For all the gory details, see the N1570 draft of the C standard, section 5.2.1; the guarantee that basic characters are non-negative is in 6.2.5p3, in the discussion of type char.)

The assumption that 'a' .. 'f' and 'A' .. 'F' have contiguous codes is almost certainly a reasonable one. In ASCII and all ASCII-based character sets, the 26 lowercase letters are contiguous, as are the 26 uppercase letters. Even in EBCDIC, the only significant rival to ASCII, the alphabet as a whole is not contiguous, but the letters 'a' ..'f' and 'A' .. 'F' are (EBCDIC has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J', and between 'R' and 'S').

There is a proposal, N3192 to update the C standard to guarantee that the numeric values of 'a' ..'f' and 'A' .. 'F', respectively, must be contiguous. This proposal is dated 2023-11-30. If it's accepted, it won't be added to the C standard until about 2026.

However, the assumption that setting bit 5 of the representation will convert uppercase letters to lowercase is not valid for EBCDIC. In ASCII, the codes for the lowercase and uppercase letters differ by 32; in EBCDIC they differ by 64.

This kind of bit-twiddling to save an instruction or two might be reasonable in code that's part of the standard library or that's known to be performance-critical. The implicit assumption of an ASCII-based character set should IMHO at least be made explicit by a comment. A 256-element static lookup table would probably be even faster at the expense of a tiny amount of extra storage.

Upvotes: 39

Dietrich Epp
Dietrich Epp

Reputation: 213368

There are no guarantees about the particular values but you shouldn't care, because your software will probably never encounter a system which is not compatible in this way with ASCII. Assume that space is always 32 and that A is always 65, this works fine in the modern world.

The C standard only guarantees that letters A-Z and a-z exist and that they fit within a single byte.

It does guarantee that 0-9 are sequential.

In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.

Justification

There are a lot of character encodings out in the world. If you care about portability, you can either make your program portable to different character sets, or you can choose one character set to use everywhere (e.g. Unicode). I'll go ahead and loosely categorize most existing character encodings for you:

  1. Single byte character encodings compatible with ISO/IEC 646. Digits 0-9 and letters A-Z and a-z always occupy the same positions.

  2. Multibyte character encodings (Big5, Shift JIS, ISO 2022-based). In these encodings, your program is probably already broken and you'll need to spend time fixing it if you care. However, parsing numbers will still work as expected.

  3. Unicode encodings. Digits 0-9 and letters A-Z, a-z always occupy the same positions. You can either work with code points or code units freely and you will get the same result, if you are working with code points below 128 (which you are). (Are you working with UTF-7? No, you should only use that for email.

  4. EBCDIC. Digits and letters are assigned different values than their values in ASCII, however, 0-9 and A-F, a-f are still contiguous. Even then, the chance that your code will run on an EBCDIC system is essentially zero.

So the question here is: Do you think that a hypothetical fifth option will be invented in the future, somehow less compatible / more difficult to use than Unicode?

Do you care about EBCDIC?

We could dream up bizarre systems all day... suppose CHAR_BIT is 11, or sizeof(long) = 100, or suppose we use one's complement arithmetic, or malloc() always returns NULL, or suppose the pixels on your monitor are arranged in a hexagonal grid. Suppose your floating-point numbers aren't IEEE 754, suppose all of your data pointers are different sizes. At the end of the day, this does not get us closer to our goals of writing working software on actual modern systems (with the occasional exception).

Upvotes: 27

fredoverflow
fredoverflow

Reputation: 263148

For maximum portability, clarity and speed, I would suggest a simple switch:

int hex_digit_value(char x)
{
    switch (x)
    {
    case '0': return 0;
    case '1': return 1;
    case '2': return 2;
    case '3': return 3;
    case '4': return 4;
    case '5': return 5;
    case '6': return 6;
    case '7': return 7;
    case '8': return 8;
    case '9': return 9;
    case 'A':
    case 'a': return 10;
    case 'B':
    case 'b': return 11;
    case 'C':
    case 'c': return 12;
    case 'D':
    case 'd': return 13;
    case 'E':
    case 'e': return 14;
    case 'F':
    case 'f': return 15;
    default: return -1;
    }
}

clang -O1 -S transforms this into a simple table lookup:

    addl    $-48, %edi
    cmpl    $54, %edi
    ja  .LBB0_2

    movslq  %edi, %rax
    movl    .Lswitch.table(,%rax,4), %eax
    retq
.LBB0_2:
    movl    $-1, %eax
    retq

For completeness, here is the generated lookup table:

.Lswitch.table:
.long   0                       # 0x0
.long   1                       # 0x1
.long   2                       # 0x2
.long   3                       # 0x3
.long   4                       # 0x4
.long   5                       # 0x5
.long   6                       # 0x6
.long   7                       # 0x7
.long   8                       # 0x8
.long   9                       # 0x9
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   10                      # 0xa
.long   11                      # 0xb
.long   12                      # 0xc
.long   13                      # 0xd
.long   14                      # 0xe
.long   15                      # 0xf
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   10                      # 0xa
.long   11                      # 0xb
.long   12                      # 0xc
.long   13                      # 0xd
.long   14                      # 0xe
.long   15                      # 0xf

Upvotes: 24

Related Questions