mtraceur
mtraceur

Reputation: 3746

Clarifying How Bitwise Operators, Non-Negative Operants, and Type Conversion Interact

Question

As a fledgling C language-lawyer, I have run into a situation where I am uncertain if I understand what C specifications logically guarantee correctly.

As I understand it, "bitwise operators" (&, |, and &) will work as intuitively expected on non-negative values with any of C's integer types (char/short/int/long/etc, whether signed or unsigned) - regardless of underlying object representation.

Is this correct understanding of what is/isn't strictly well-defined behavior in C?

Key Point

In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result (from the operation itself, not from assigning/interpreting the result into/as an inappropriate type).

Example

Consider the following code:

#include <limits.h>
#define MOST_SIGNIFICANT_BIT = (unsigned char )((UCHAR_MAX >> 1) + 1)

/* ... in some function: */

unsigned char byte;
/* Using broad meaning of "byte", not necessarily "octet" */
int val;

/* val is assigned an arbitrary _non-negative_ value at runtime */

byte = val | MOST_SIGNIFICANT_BIT;

Do note the above comment that val receives a non-negative value at runtime (which can be represented by val's type).

My expectation is that byte has the most significant bit set, and the lower bits are the pure binary representation (no padding bits or trap representations) of the bottom CHAR_BIT - 1 bits of the numerical value of val.

I expect this to remain true even if the type of val is changed to any other integer type, but I expect this guarantee to disappear as soon as the value of val becomes negative (no one result guaranteed for all implementations), or the type of val is changed to any non-integer type (violates a constraint of C's definition of the bitwise operators).

Self-Answering

I am posting my explanation of my current understanding as an answer because I'm fairly confident in it, but I am looking for corrections to any of my misconceptions and will accept any better/correcting answer instead of mine.

Upvotes: 4

Views: 147

Answers (2)

M.M
M.M

Reputation: 141648

In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result

This is covered by C11 section 6.2.6.2 (C99 is similar). There is a footnote that clarifies the intent of the more technical text:

Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types.

The bitwise operators are arithmetic operations as discussed here.

In this footnote, "trap representation" excludes the special case "negative zero". The negative zero may or may not cause UB, but it has its own text (in 6.2.6.2 also) separate from the trap representation text.

So your question can actually be answered for both signed and unsigned values; the only dangerous case is the "negative zero" possibility. (Which can't occur from non-negative input).

Upvotes: 2

mtraceur
mtraceur

Reputation: 3746

Why (I Think) This Is (Probably) Correct

The bitwise operators &, |, and ^ are defined as operating on the actual binary representation of the converted operands: the operands are said to undergo the "usual arithmetic conversions".

As I understand it, it logically follows that when you use two integer type expressions with non-negative values as operands to one of those operators, regardless of padding bits or trap representations, their value bits will "line up": therefore, the result's value bits will have the numerical value which matches with what you'd expect if you just assumed a "pure binary representation".

The kicker is that so long as you start with valid (non-trap), non-negative values as the operands, the operands should always promote to an integer type which can represent both of their values, and thus can logically represent the result value for either of those three operations. You also never run into possible issues with e.g. "signed zero", because limiting yourself to non-negative values avoids such problems. And so long as the result is used as a type which can hold the result value (or as an unsigned integer type), you won't introduce other similar/related issues.

Can These Operations Produce A Trap Representation From Non-Negative Non-Trap Operands?

Footnotes 44/53 and 45/54 of the last C99/C11 drafts, respectively, seem to suggest that the answer to this uncertainty depends on whether foo | bar, foo & bar, and foo ^ bar are considered "arithmetic operations". If they are, then they are not allowed to produce a trap representation result given non-trap values.

The Index of C99 and C11 standard drafts lists the bitwise operators as a subset of "arithmetic operators", suggesting yes. Though C89 doesn't organize its index that way, and my C Programming Language (2nd Edition) has a section called "arithmetic operators" which includes just +, -, *, /, and %, leaving the bitwise operators to a separate section. In other words, there is no clear answer on this point.

In practice, I don't know of any systems where this would happen (given the constraint of non-negative values for both operands), for what that's worth.

And one may consider the following: the type unsigned char is expected (and essentially blessed by C99 and C11) to be capable of accessing all bits of the underlying object representation of a type - it seems likely that the intent is that bitwise operators would work correctly with unsigned char - which would be integer-promoted to int on most modern systems, an unsigned int on the rest: therefore it seems very unlikely that foo | bar, foo & bar, or foo ^ bar would be allowed to produce trap representations - at least if foo and bar are both values that can be held in an unsigned char, and if the result is assigned into an unsigned char.

It is very tempting to generalize from the prior two points that this is a non-issue, although I wouldn't call it a rigorous proof.

Applied to the Example

Here's why I think this is correct and will work as expected:

  1. UCHAR_MAX >> 1 subjects UCHAR_MAX to "usual arithmetic conversions": By definition, UCHAR_MAX will fit into either an int or unsigned int, because on most systems int can represent all values of unsigned char, and on the few that don't, an unsigned int has to be able to represent all values of unsigned char`, so that's just an "integer promotion" in this case.

  2. Because bit shifts are defined in terms of values and not bitwise representations, UCHAR_MAX >> 1 is the quotient of UCHAR_MAX being divided by 2. (Let's call this result UCHAR_MAX_DIV_2).

  3. UCHAR_MAX_DIV_2 + 1 subjects both arguments to usual arithmetic conversion: If UCHAR_MAX fit into an int, then the result is an int, otherwise, it is an unsigned int. Either way the conversions stop at integer promotion.

  4. The result of UCHAR_MAX_DIV_2 + 1 is a positive value, which, when converted into an unsigned char, will have the most significant bit of the unsigned char set, and all other bits cleared (because the conversion would preserve the numerical value, and unsigned char is very strictly defined to have a pure binary representation without any padding bits or trap representations - but even without such an explicit requirement, the resulting value would have the most significant value bit set).

  5. The (unsigned char) cast of MOST_SIGNIFICANT_BIT is actually redundant in this context - cast or no cast, it's going to be subject to the "usual arithmetic conversions" when bitwise-OR'ed. (but it might be useful in other contexts).

  6. The above five steps will be constant-folded on pretty much every compiler out there - but a proper compiler should not constant-fold in a way which differs from the semantics of the code if it hadn't, so all of the above applies.

  7. val | MOST_SIGNIFICANT_BIT is where it gets interesting: unlike << and >>, | and the other binary operators are defined in terms of manipulating the binary representations. Both val and MOST_SIGNIFICANT_BIT are subject to usual arithmetic conversions: details like the layout of bits or trap representations might mean a different binary representation, but should preserve the value: Given two variables of the same integer-type, holding non-negative, non-trap values, the value bits should "line up" correctly, so I actually expect that val | MOST_SIGNIFICANT_BIT produces the correct value (let's call this result VAL_WITH_MSB_SET). I don't see an explicit guarantee that this step couldn't produce a trap representation, but I don't believe there's an implementation of C where it would.

  8. byte = VAL_WITH_MSB_SET forces a conversion: conversion of an integer type (so long as the value is a non-trap value) into a smaller, unsigned integer type is well defined: In this case, the value is reduced modulo UCHAR_MAX + 1. Since val is stated to be positive, the end result is that byte has the value of the remainder of VAL_WITH_MSB_SET divided by UCHAR_MAX + 1.

Explaining Where It Doesn't Work

If val were to be a negative value or non-integer type, we'd be out of luck because there is no longer a logical certainty that the bits that get binary-OR'ed will have the same "meaning":

  1. If val is a signed integer type but has a negative value, then even though MOST_SIGNIFICANT_BIT is promoted to a compatible type, even though the value bits "line up", the result doesn't have any guaranteed meaning (because C makes no guarantee about how negative numbers are encoded), nor is the result (for any encoding) going to have the same meaning, especially once assigned into the unsigned char at the last step.

  2. If val has a non-integer type, it's already violating the C standard, which constrains |, &, and ^ operators to operating on the "integer types". But if your compiler allowed it (or you did some tricks using unions, etc), then you have no guarantees about what each bit means, and thus the bit that you set are meaningless.

Upvotes: 2

Related Questions