Reputation: 31
To narrow down my question, let me describe my assumption and the experiment that I did...
My assumption: A code written in assembly language will run much faster than its C/C++ counterpart and also the executable size to be much smaller than that generated from C/C++ code.
The experiment: I wrote the below program in to bin2dec.c
#include <stdio.h>
int main()
{
long int binary, decimal, reminder, exp;
int i, j;
for(i=0; i<10000; i++)
{
for(j=0; j<1000; j++)
{
binary = 11000101;
exp = 1;
decimal = 0;
while(binary != 0)
{
reminder = binary % 10;
binary = binary / 10;
decimal = decimal + reminder * exp;
exp *= 2;
}
}
}
return 0;
}
Then generated the ASM code for it gcc -S bin2dec.c -o bin2dec.s
After that I compiled both the files as below
gcc bin2dec.c -o bin2dec_c
gcc bin2dec.s -o bin2dec_s
Test 1: Found out some internal details of both the files
[guest@localhost ASM]$ size bin2dec_c bin2dec_s
text data bss dec hex filename
951 252 4 1207 4b7 bin2dec_c
951 252 4 1207 4b7 bin2dec_s
Result: Both are exactly same...
Test 2: Executed the files and calculated the time taken
[guest@localhost ASM]$ time ./bin2dec_c
real 0m1.724s
user 0m1.675s
sys 0m0.002s
[guest@localhost ASM]$ time ./bin2dec_s
real 0m1.721s
user 0m1.676s
sys 0m0.001s
Result: Both are same. Some time the executable generated from ASM ran slower :-(
So the question is, whether my assumptions were wrong? If not, what mistake I did so that both the executables bin2dec_c and bin2dec_s ran at the same speed? Is there any better way to get ASM code from a C/C++ program or I should rewrite all the logic from the scratch in ASM to gain the advantage of speed and program size?
Upvotes: 1
Views: 1981
Reputation: 1
It is an old tradition (in the 1970s on early Unix systems, the machine was so small that it was simpler to generate some assembler file), and some compilers can generate directly object files or machine code; probably some recent version of Clang/LLVM, or TinyCC (for C only: fast compilation time, but very slow executable!) perhaps some proprietary XLC compiler from IBM, and some people amongst the GCC community are thinking about that (notably for GCCJIT).
However, generating an assembler file is often easier for compiler developers. And since most of the compiler work happens in optimization passes (which are transforming some internal representations in the compiler), losing a few milliseconds to start the assembler is not very important.
With GCC, compile with gcc -time
and gcc -ftime-report
(and of course your usual optimization flags, e.g. -O2
) to understand where the compiler spends its time. It is never in the assembler...
You might sometimes find useful to look into the generated assembler file. Compile your foo.cc
C++11 file with g++ -O2 -Wall -S -fverbose-asm -std=c++11 foo.cc
then look (with some editor or pager) into the generated foo.s
assembler file.
You could even compile with g++ -fdump-tree-all -O2
and get hundreds of compiler dump files from GCC explaining what transformations the compiler did on your code.
BTW today's (superscalar, pipelined) processors (the ones in your desktop, your laptop, your tablet, your server) are so complex that in practice a compiler can optimize better than a human programmer. So practically speaking the assembler code produced by an optimizing compiler from some realistically sized C code (e.g. a C source file of a few hundred lines) is often faster than what an experimented assembler human programmer can code in a few weeks (less than a thousand assembler lines). In other words your assumption (that code human-written in assembler is faster/better than code human-written in C and compiled by a good optimizing compiler) is wrong in practice.
(BTW, an optimizing compiler is permitted to transform your bin2dec.c
program, which has no observable side-effects, e.g. no input and output, into an empty program, and GCC 5.2 does that with gcc -O2
!!)
Read also about the halting problem and Rice's theorem. There is a intrinsic limitation in what optimizing compilers or static program analyzers can achieve.
Upvotes: 5
Reputation: 365237
Assumption: A code written in assembly language will run much faster than its C/C++ counterpart and also the executable size to be much smaller than that generated from C/C++ code.
Assembly language is just a text representation for machine code.
With a few caveats, you can disassemble a binary, and re-assemble that source back into the same binary. Apparently this is truly possible for ARM, but x86 asm dialects don't have syntax to represent different encodings of the same instruction. e.g. forcing use of a 4-byte offset in a jmp
instruction in the PLT (Procedure Linking Table), where the jump targets will be patched later.
Your experiment made two identical binaries. gcc going directly from C to an executable internally makes an asm source file and assembles it. You just split up the process so you could get your hands on the compiler-generated asm.
Hand-written assembly code is always at least as good as compiler output. You can always start with compiler output and look for improvements. In rare cases, there won't be any improvements possible.
Simply observing the compiler-generated asm during the compile process doesn't do anything to improve it, though! Plug your code into http://gcc.godbolt.org/ to see the output with various different compilers (or even for ARM or PPC, which is interesting for std:atomic
code, to see what happens on a weakly-ordered arch)
Since you compiled with no optimizations, there are certainly huge improvements to be made. I'd start with the output of gcc -O3 -march=native -fverbose-asm -masm=intel -S
It's very rare for compiler output to be truly optimal, though, even for short sequences. Where compilers have the advantage on humans is in keeping track of a whole lot of source code at once, and making optimizations based on stuff they can prove across functions. (Such whole-program optimizations would be too brittle to maintain in source code by humans.) So compilers can taking advantage of stuff that happens to be true in this build, but isn't part of the design of the functions being compiled.
Compilers almost always do a good job, but extremely rarely a great job. What's important is that it's a good enough job, and the code runs fast, even if it uses more instructions than needed. Usually stuff like branch mispredicts, cache misses, and dependency chains are bottlenecks, and CPUs are wide enough to handle the extra instructions compilers tend to use without significant slowdowns. With hyperthreading, doing the same work with fewer instructions is a bigger advantage.
For a concrete example, see the compiler outputs on https://codereview.stackexchange.com/questions/6502/fastest-way-to-clamp-an-integer-to-the-range-0-255, and compare that with my probably-optimal hand-written asm. I tried to get gcc to generate similarly optimal output, but with no success. It either used multiple branches or two cmov instructions (which would make the no-clamping fast-path slow), rather than the branch for clamping or not, then a cmov for clamp to zero or clamp to max.
Upvotes: 3