Reputation: 530
I wonder how the compilers deal with undefined behavior.
I will take GCC 10.4
for x86
architecture and -O2 -std=c++03
flags as an example, but please feel free to comment on other compilers.
What does it take to alter the outcome of an operation with UB?
The language standard does not prescribe what should happen if an operation has UB, but compiler will do something. That is, I'm not asking what happens in UB from C++'s perspective but from compiler's perspective. I know the c++ standard does not impose any restriction on the behavior of the program.
For example, if I have UB due to the value of the object in a memory location being modified more than once by the evaluation of an expression, like so:
int i = 0;
i = ++i + i++; // UB pre-C++11
the chosen compiler in this setup generates the assembly code that reduces the computation to a constant, 3 in this case, see https://godbolt.org/z/MEEGT15dM.
What can cause the constant to become anything rather than 3 if I do not change the compiler, its version, flags or architecture? Could editing the function without changing the value of i
before the erroneous statement cause it?
Upvotes: -2
Views: 389
Reputation: 223494
The C and C++ language standards define “undefined behavior” to be behavior for which the standard imposes no requirements. Note the emphasized part. In particular, this does not mean there are no requirements for the behavior as a whole, but only from the language standard's perspective. There may be requirements from other specifications that the compiler seeks to conform to, including its own.
Compilers commonly support many things that are “undefined behavior” in the sense of a language standard. A few examples are (with some GCC manual links because you asked about gcc):
alignas
/ _Alignas
becoming part of ISO C++ and ISO C,__builtin_popcount
and __builtin_add_overflow
)Some compilers define the behavior of certain things the C or C++ standards leave undefined or have options to define the behavior. For example MSVC and gcc -fno-strict-aliasing
define the behavior of * (uint32_t) &my_float
, but GCC does not by default.
GCC also has command-line options to define the behavior of signed-integer overflow as two’s complement wrap-around (-fwrapv
) or as trapping (-ftrapv
), or to define the behavior of some things that are otherwise undefined, such as as printing a warning message or printing an error message and aborting (-fsanitize=undefined
). (Normally being undefined behavior does not rule out “happens to work’ behavior that makes unit-testing insufficient to verify correctness in other contexts.)
Anything a compiler supports should be stable; it should not be affected by changing optimization switches, language-variant-selection switches, or other switches except as documented by the compiler. So these “undefined behaviors” should be consistent.
Outside of these, there are things that are neither defined by the applicable language standard nor by the compiler (directly in its own documentation or indirectly through specifications it seeks to conform to). For the most part, you should regard these as not stable. Behaviors that are not at all part of the compiler design may change when optimization switches are changed, when other code is changed, when patterns of memory use or contents of memory are changed, and so on.
Although you generally cannot rely on such behaviors, this does not mean they are without pattern. Compilers are not designed randomly; they have properties that arise out of their design. Experienced programmers may recognize certain symptoms as clues about what is wrong in a program. Even though the behavior is undefined (by the language standard and by the compiler), it nonetheless may fall into a pattern because of how we design software. For example, overrunning a buffer may corrupt data further up (earlier) on the stack. This is not guaranteed to happen; optimization can change what happens when a buffer is overrun, but it is nonetheless a common result. Furthermore, it is a result some people do rely on. Malicious people may seek to exploit buffer overruns to attack programs and steal information or money, to take control of systems, or to crash or otherwise cause denial of service. The behavior they exploit is not random; it is at least partly predictable, and that is what affords them the opportunity to exploit it. So even fully undefined behavior cannot be regarded as random; good programmers must consider the consequences of undefined behavior and seek to mitigate it.
What can cause the constant to become anything rather than 3 if I do not change the compiler, its version, flags or architecture?
For the most part, if you change nothing about a compilation, you should get the same result every time, with a few exceptions. This is because a compiler is a machine; it proceeds mechanically and executes its program mechanically. If the compiler has no bugs, then its behavior should be defined by its source code (even if we, the users, do not know what the definition is), and that means that, given the same input and circumstances, it should produce the same output.
One exception is that compilers might inject date or time information into their output. Similarly, other variations in the execution environment might cause some changes. Another issue is that the output of the compiler is object code, and the object code is not the complete program, so the final program may be influenced by other things. An example is that modern multi-user operating systems commonly use address space layout randomization, so many of the addresses in a program will vary from execution to execution. This is unlikely to affect your i = ++i + i++;
example, but it means other bugs resulting in undefined behavior can exhibit some randomness due to the addresses involved.
Once C is compiled to machine code, that machine code has specific behavior for specific inputs. (Which may depend on things like size of environment variables or details of what libraries do, or on values in stack memory.) Unlike the abstract machine of the C standard, the execution model for machine code on most machines does not include much if any unpredictable behavior, and compilers normally avoid generating machine code with unpredictable behavior, even for C code paths that have compile-time-visible undefined behavior.
Upvotes: 3
Reputation: 81217
The name C is used to describe a variety of language dialects which share a some core language features. The Standard was chartered to describe the common core features of those dialects in a manner that was agnostic to features that, while common, were not universal. While the Standard does not require that implementations behave in a manner consistent with a "high-level assembler", the authors of the Standard have expressly said (citation below) that they did not wish to preclude the use of the language for that purpose.
If an implementation is designed to be suitable for low-level programming on a particular platform, it will process many constructs "in a documented manner characteristic of the environment" in cases where doing so would be useful for its customers, without regard for whether or not the Standard would require that it do so.
What gets tricky is that when optimization is enabled, some compilers are designed to identify circumstances where the Standard would not impose any requirements on the behavior of a certain function unless certain inputs are received, and then replace parts of the source that would check whether such inputs are received, with machine code that blindly assumes they will be.
Such replacement will be useful if all of the inputs the functions receive are consistent with such assumptions, but disastrous if the functions receive inputs which would have yielded acceptable--or even useful--behavior if processed "in a documented manner characteristic of the environment", but whose behavior isn't mandated by the Standard.
Things get even trickier if one factors in the fact that implementations which process integer arithmetic in a manner that might not always yield predictable values, but could never possibly have any side effects beyond yielding possibly-meaningless values, rarely document the latter guarantee if their authors can't imagine compilers for their target platform failing to uphold it. Unfortunately, the Standard provides no means of distinguishing implementations which uphold the latter guarantee from those that don't, and thus allowing programmers to invite useful optimizations that might cause a program that would have behaved in one way to behave in an observably different, but equally acceptable way.
Anyone wanting to understand Undefined Behavior should do two things:
Read the published Rationale document for the C Standard, located at at https://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf (most notably page page 11, starting at line 23, but also page 2 lines 32-36; but also page 13 lines 5-8; page 44 starting on line 20, and page 60 lines 17-19).
Recognize that while the Rationale document describes the language the Committee was chartered to describe, some compiler maintainers aggressively regard situations where the Standard fails to mandate that compilers process code correctly, or in a manner consistent with the authors of the Standard expected of "most current implementations", as implying a judgment that no possible ways of handling such situations would be any worse than any others.
Upvotes: 1