obskyr
obskyr

Reputation: 1459

Is C implicitly and weirdly casting this char in an array to an int?

I've got a function which is supposed to insert a short into a char array, big-endian. This is what it looks like:

unsigned short getShort(char* arr, int index)
{
    unsigned short n = 0;
    int i;
    for (i = 0; i <= 1; i++)
    {
        n <<= 8;
        n |= arr[index + i];
    }
    return n;
}

Instead of working as it should, however, everything but the least significant byte (AKA the most significant byte in this case) gets transformed into 0xFF. If I insert printf("%x\n", arr[index + i]); into the beginning of the for loop (and a separator after), I get this output:

ffffffaa
ffffff88
---
0
8
---
0
0
---
0
0
---
...
---
ffffffb9
ffffffe8
---
0
e
---
0
e
---
...

Some bytes are just padded with 0xFF, bringing them up to 32 bits. The first two bytes are supposed to be 0xAA and 0x88, and those second strange ones 0xB9 and 0xE8, but apparently they don't turn out that way. In fact, examining n every step of the way, it definitely gets |ed with the 32-bit number instead of the 8-bit char.

The weirdest part is sizeof(arr[index + i]) still returns 1, and switching out n |= arr[index + i]; for n |= (char) arr[index + i]; has the same result. What does get me the correct values is switching it for n |= arr[index + i] & 0xFF;, but... it should already be 8 bits, right?

So what the heck is happening here?

Upvotes: 2

Views: 111

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 753805

Plain char can be a signed or unsigned type; on your machine, it appears to be signed. When a signed value with the high bit set is converted to an int, it is converted to a negative int. That's why you get the result you see.

When the value arr[index + i] is passed to printf(), it is converted to an int because that is how small types are handled when passed to variadic functions like printf() — char and short are converted to int, and float is converted to double.

There are also problems in the function. You should use one of:

unsigned short getShort(char* arr, int index)
{
    unsigned short n = 0;
    int i;
    for (i = 0; i <= 1; i++)
    {
        n <<= 8;
        n |= (unsigned char)arr[index + i];
    }
    return n;
}

or:

unsigned short getShort(char* arr, int index)
{
    unsigned short n = 0;
    int i;
    for (i = 0; i <= 1; i++)
    {
        n <<= 8;
        n |= arr[index + i] & 0xFF;
    }
    return n;
}

Though frankly, the loop is a bit of overkill; you could use:

unsigned short getShort(char* arr, int index)
{
    return (arr[index + 0] << 8) | (arr[index + 1] & 0xFF);
}

and if you have a C99 compiler, you could even add the inline function specifier which might give you the benefits of macro-like behaviour with the safety of a true function:

static inline unsigned short getShort(char* arr, int index)
{
    return (arr[index + 0] << 8) | (arr[index + 1] & 0xFF);
}

There's a moderate chance that the compiler's optimizer would produce code more or less equivalent to the functions with just a return statement even if you left the code written as a loop. If you need to have similar functions for 4-byte and 8-byte integers, keeping the loop might be better for consistency.

Note that I am making assumptions such as sizeof(short) == 2 and CHAR_BIT == 8. These are not guaranteed by the C standard, but they are the commonest configuration on desktop and server machines.


But…

obskyr asks:

This makes … no sense. It isn't converted to a negative int, it's implicitly converted to an int with all the 24 high bits set to 1. Which isn't the same number. And it's not only in the printf, but in n |= arr[index + i] too. Why is this, and why does it not convert to the actual number?

There are several misconceptions here. First, I said 'a negative int'; I did not say the value of the unsigned value negated. For example, 0xFF maps to -1, but 0xFF as an unsigned number is 255, not 1.

The 'why' is because the C standard says that's what should happen. I've omitted a section which describes 'ranks', but generally speaking, shorter types have a lower rank than longer types.

ISO/IEC 9899:2011

The current C standard, C11, says what follows, but the earlier version said very much the same thing in much the same words:

§6.3 Conversions

§6.3.1 Arithmetic conversions

§6.3.1.1 Boolean, characters and integers


¶2 The following may be used in an expression wherever an int or unsigned int may be used:

  • An object or expression with an integer type (other than int or unsigned int) whose integer conversion rank is less than or equal to the rank of int and unsigned int.
  • A bit-field of type _Bool, int, signed int, or unsigned int.

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions.58) All other types are unchanged by the integer promotions.

¶3 The integer promotions preserve value including sign. As discussed earlier, whether a ‘‘plain’’ char is treated as signed is implementation-defined.

58) The integer promotions are applied only: as part of the usual arithmetic conversions, to certain argument expressions, to the operands of the unary +, -, and ~ operators, and to both operands of the shift operators, as specified by their respective subclauses.

 

6.3.1.8 Usual arithmetic conversions

¶1 Many operators that expect operands of arithmetic type cause conversions and yield result types in a similar way. The purpose is to determine a common real type for the operands and result. For the specified operands, each operand is converted, without change of type domain, to a type whose corresponding real type is the common real type. Unless explicitly stated otherwise, the common real type is also the corresponding real type of the result, whose type domain is the type domain of the operands if they are the same, and complex otherwise. This pattern is called the usual arithmetic conversions:

and this material is followed by a list of rules, moving on to:

Otherwise, the integer promotions are performed on both operands. Then the following rules are applied to the promoted operands:


So, in the context of the expression:

n |= arr[index + i];

This is equivalent to:

n = n | arr[index + i];

And in this context, the value n on the RHS is promoted to int, and the value of arr[index + i] is promoted to int, and the | operation works on two int values, and the result is then converted to unsigned short, which is the type of n.

§6.5.12 Bitwise inclusive OR operator

Constraints
2 Each of the operands shall have integer type.
Semantics
¶3 The usual arithmetic conversions are performed on the operands.
¶4 The result of the | operator is the bitwise inclusive OR of the operands (that is, each bit in the result is set if and only if at least one of the corresponding bits in the converted operands is set).

(Note that 'integer type' is not the same as 'type int'.)

And in the context of a function call to a function with variadic arguments:

§6.5.2.2 Function calls

¶6 6 If the expression that denotes the called function has a type that does not include a prototype, the integer promotions are performed on each argument, and arguments that have type float are promoted to double. These are called the default argument promotions.

¶7 If the expression that denotes the called function has a type that does include a prototype, the arguments are implicitly converted, as if by assignment, to the types of the corresponding parameters, taking the type of each parameter to be the unqualified version of its declared type. The ellipsis notation in a function prototype declarator causes argument type conversion to stop after the last declared parameter. The default argument promotions are performed on trailing arguments.

Upvotes: 5

wallyk
wallyk

Reputation: 57774

The value is being sign extended in the printf. Apparently your compiler's default is signed char which has a range of -128 .. 127. It is not using 32 bits, only 8.

When a signed char is promoted to an int, it performs sign extension to 32 bits (in your case). Such conversions are common in C.

Upvotes: 2

Related Questions