The inner workings of the g++ compiler, for dummies

Question

I have a very basic understanding of the stages of compilation when compiling C/C++ code with g++ but I want confirmation, clarification and additional pearls of wisdom, please.

For this set of files:

main.c
foo.h
foo.c
bar.h
bar.c

These calls do the following...

g++ -c foo.c
g++ -c bar.c
g++ -c main.c

The header files are now added into the source files and all these .c files are compiled into .o files.

g++ -o main.out main.o foo.o bar.o

now all the .o files are linked together into a single executable - main.out.

Cameron · Accepted Answer

The .c files are compiled into object files, which are then linked into a final binary. Object files are basically unfinalized pieces of the binary (they contain the compiled machine code for the functions, etc. defined in the .c files).

The .c files, during compilation, include header files, which are essentially just expanded in-place where the #include directive is. In that sense, a .c file stands alone, and there is no need for the headers to be compiled separately; they are all part of a single translation unit which gets turned into a single object file.

The first step in compilation is for the preprocessor to run; this is a fancy text manipulator that handles all the lines starting with # (so, it does the expansion of #include directives and conditional #ifdefs, etc.).

Then, the text of a translation unit is tokenized (this is called lexical analysis): the bytes are turned into the simplest possible recognizable tokens, for example '.' becomes a DOT, '++' becomes a single 'INCREMENT', keywords are recognized, and variable names are parsed as entire entities (identifiers). The tokens still have no meaning, but they're easier to work with than a byte stream.

The next logical step, called syntactic analysis, turns the stream of tokens into abstract structures based on the grammar (syntax) of the language. This is where syntax errors are reported. For example, int a = 3; might be parsed as declaration(sym(a), expression(constint(3))).

The next logical step after that is semantic analysis, which gives meaning to the syntactic structures -- for example, the parser might generate twenty variable declarations with the same name, but semantically this makes no sense. More errors are reported here, e.g. non-void functions that don't return from all control paths.

Finally, there is a code generation step, which selects low-level CPU instructions to execute the semantic structures of the translation unit. This is actually a huge "step", and may include further transformations on the semantic data structures (usually in the form of an abstract syntax tree or AST) into lower-level (intermediate) representations before the final instruction code is generated.

In practice some of these passes are combined (e.g. tokenization usually happens on-demand during the syntactic analysis phase, which may also be constructing semantically meaningful symbol tables, etc.). There's also various optimizations (some integrated, some in separate passes) sprinkled throughout. I believe GCC, for example, transforms the program into an SSA intermediate representation to do data-flow analysis for better code optimization.

The generated instructions, global and static variables, and so on, are then dumped into an object file. The object files are then linked together into an executable (the addresses of global variables, functions defined in other, external object files and dynamic/shared libraries are resolved at this time and fixed up in the final code).

None of this is specific to gcc; this applies to most (all?) C++ (and C) compilers.

The inner workings of the g++ compiler, for dummies

Answers (1)

Related Questions