Reputation: 234554
So, I had this code:
constexpr unsigned N = 1000;
void f1(char* sum, char* a, char* b) {
for(int i = 0; i < N; ++i) {
sum[i] = a[i] + b[i];
}
}
void f2(char* sum, char* a, char* b) {
char* end = sum + N;
while(sum != end) {
*sum++ = *a++ + *b++;
}
}
I wanted to see the code that GCC 4.7.2 would generate. So I ran g++ -march=native -O3 -masm=intel -S a.c++ -std=c++11
And got the following output:
.file "a.c++"
.intel_syntax noprefix
.text
.p2align 4,,15
.globl _Z2f1PcS_S_
.type _Z2f1PcS_S_, @function
_Z2f1PcS_S_:
.LFB0:
.cfi_startproc
lea rcx, [rdx+16]
lea rax, [rdi+16]
cmp rdi, rcx
setae r8b
cmp rdx, rax
setae cl
or cl, r8b
je .L5
lea rcx, [rsi+16]
cmp rdi, rcx
setae cl
cmp rsi, rax
setae al
or cl, al
je .L5
xor eax, eax
.p2align 4,,10
.p2align 3
.L3:
movdqu xmm0, XMMWORD PTR [rdx+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
paddb xmm0, xmm1
movdqu XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, 992
jne .L3
mov ax, 8
mov r9d, 992
.L2:
sub eax, 1
lea rcx, [rdx+r9]
add rdi, r9
lea r8, [rax+1]
add rsi, r9
xor eax, eax
.p2align 4,,10
.p2align 3
.L4:
movzx edx, BYTE PTR [rcx+rax]
add dl, BYTE PTR [rsi+rax]
mov BYTE PTR [rdi+rax], dl
add rax, 1
cmp rax, r8
jne .L4
rep
ret
.L5:
mov eax, 1000
xor r9d, r9d
jmp .L2
.cfi_endproc
.LFE0:
.size _Z2f1PcS_S_, .-_Z2f1PcS_S_
.p2align 4,,15
.globl _Z2f2PcS_S_
.type _Z2f2PcS_S_, @function
_Z2f2PcS_S_:
.LFB1:
.cfi_startproc
lea rcx, [rdx+16]
lea rax, [rdi+16]
cmp rdi, rcx
setae r8b
cmp rdx, rax
setae cl
or cl, r8b
je .L19
lea rcx, [rsi+16]
cmp rdi, rcx
setae cl
cmp rsi, rax
setae al
or cl, al
je .L19
xor eax, eax
.p2align 4,,10
.p2align 3
.L17:
movdqu xmm0, XMMWORD PTR [rdx+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
paddb xmm0, xmm1
movdqu XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, 992
jne .L17
add rdi, 992
add rsi, 992
add rdx, 992
mov r8d, 8
.L16:
xor eax, eax
.p2align 4,,10
.p2align 3
.L18:
movzx ecx, BYTE PTR [rdx+rax]
add cl, BYTE PTR [rsi+rax]
mov BYTE PTR [rdi+rax], cl
add rax, 1
cmp rax, r8
jne .L18
rep
ret
.L19:
mov r8d, 1000
jmp .L16
.cfi_endproc
.LFE1:
.size _Z2f2PcS_S_, .-_Z2f2PcS_S_
.ident "GCC: (GNU) 4.7.2"
.section .note.GNU-stack,"",@progbits
I suck at reading assembly, so I decided to add some markers to know where the bodies of the loops went:
constexpr unsigned N = 1000;
void f1(char* sum, char* a, char* b) {
for(int i = 0; i < N; ++i) {
asm("# im in ur loop");
sum[i] = a[i] + b[i];
}
}
void f2(char* sum, char* a, char* b) {
char* end = sum + N;
while(sum != end) {
asm("# im in ur loop");
*sum++ = *a++ + *b++;
}
}
And GCC spat this out:
.file "a.c++"
.intel_syntax noprefix
.text
.p2align 4,,15
.globl _Z2f1PcS_S_
.type _Z2f1PcS_S_, @function
_Z2f1PcS_S_:
.LFB0:
.cfi_startproc
xor eax, eax
.p2align 4,,10
.p2align 3
.L2:
#APP
# 4 "a.c++" 1
# im in ur loop
# 0 "" 2
#NO_APP
movzx ecx, BYTE PTR [rdx+rax]
add cl, BYTE PTR [rsi+rax]
mov BYTE PTR [rdi+rax], cl
add rax, 1
cmp rax, 1000
jne .L2
rep
ret
.cfi_endproc
.LFE0:
.size _Z2f1PcS_S_, .-_Z2f1PcS_S_
.p2align 4,,15
.globl _Z2f2PcS_S_
.type _Z2f2PcS_S_, @function
_Z2f2PcS_S_:
.LFB1:
.cfi_startproc
xor eax, eax
.p2align 4,,10
.p2align 3
.L6:
#APP
# 12 "a.c++" 1
# im in ur loop
# 0 "" 2
#NO_APP
movzx ecx, BYTE PTR [rdx+rax]
add cl, BYTE PTR [rsi+rax]
mov BYTE PTR [rdi+rax], cl
add rax, 1
cmp rax, 1000
jne .L6
rep
ret
.cfi_endproc
.LFE1:
.size _Z2f2PcS_S_, .-_Z2f2PcS_S_
.ident "GCC: (GNU) 4.7.2"
.section .note.GNU-stack,"",@progbits
This is considerably shorter, and has some significant differences like the lack of SIMD instructions. I was expecting the same output, with some comments somewhere in the middle of it. Am I making some wrong assumption here? Is GCC's optimizer hindered by asm comments?
Upvotes: 83
Views: 9152
Reputation: 8268
This answer is now modified: it was originally written with a mindset considering inline Basic Asm as a pretty strongly specified tool, but it's nothing like that in GCC. Basic Asm is weak and so the answer was edited.
Each assembly comment acts as a breakpoint.
EDIT: But a broken one, as you use Basic Asm. Inline asm
(an asm
statement inside a function body) without explicit clobber list is a weakly specified feature in GCC and its behavior is hard to define. It doesn't seem (I don't fully grasp its guarantees) attached to anything in particular, so while the assembly code must be run at some point if the function is run, it isn't clear when it is run for any non trivial optimization level. A breakpoint that can be reordered with neighboring instruction isn't a very useful "breakpoint". END EDIT
You could run your program in an interpreter that breaks at each comment and prints out the state of every variable (using debug information). These points must exist so that you observe the environment (state of registers and memory).
Without the comment, no observation point exists, and the loop is compiled as a single mathematical function taking an environment and producing a modified environment.
You want to know the answer of a meaningless question: you want to know how each instruction (or maybe block, or maybe range of instruction) is compiled, but no single isolated instruction (or block) is compiled; the whole stuff is compiled as a whole.
A better question would be:
Hello GCC. Why do you believe this asm output is implementing the source code? Please explain step by step, with every assumption.
But then you wouldn't want to read a proof longer than the asm output, written in term of GCC internal representation.
Upvotes: -2
Reputation: 129454
I don't agree with the "gcc doesn't understand what is in the asm()
block". For example, gcc can deal quite well with optimising parameters, and even re-arranging asm()
blocks such that it intermingles with the generated C code. This is why, if you look at inline assembler in for example the Linux kernel, it is nearly always prefixed with __volatile__
to ensure that the compiler "doesn't move the code around". I have had gcc move my "rdtsc" around, which made my measurements of the time it took to do certain thing.
As documented, gcc treats certain types of asm()
blocks as "special", and thus doesn't optimise the code either side of the block.
That's not to say that gcc won't, sometimes, get confused by inline assembler blocks, or simply decide to give up on some particular optimisation because it can't follow the consequences of the assembler code, etc, etc. More importantly, it can often get confused by missing clobber tags - so if you have some instruction like cpuid
that changes the value of EAX-EDX, it but you wrote the code so that it only uses EAX, the compiler may store things in EBX, ECX and EDX, and then your code acts very strange when these registers are overwritten... If you are lucky, it crashes immediately - then it's easy to figure out what goes on. But if you are unlucky, it crashes way down the line... Another tricky one is the divide instruction that give a second result in edx. If you don't care about the modulo, it's easy to forget that EDX was changed.
Upvotes: 3
Reputation: 47018
The interactions with optimisations are explained about halfway down the "Assembler Instructions with C Expression Operands" page in the documentation.
GCC doesn't try to understand any of the actual assembly inside the asm
; the only thing it knows about the content is what you (optionally) tell it in the output and input operand specification and the register clobber list.
In particular, note:
An
asm
instruction without any output operands will be treated identically to a volatileasm
instruction.
and
The
volatile
keyword indicates that the instruction has important side-effects [...]
So the presence of the asm
inside your loop has inhibited a vectorisation optimisation, because GCC assumes it has side effects.
Upvotes: 66
Reputation: 58772
Note that gcc vectorized the code, splitting the loop body into two parts, the first processing 16 items at a time, and the second doing the remainder later.
As Ira commented, the compiler doesn't parse the asm block, so it does not know that it's just a comment. Even if it did, it has no way of knowing what you intended. The optmized loops have the body doubled, should it put your asm in each? Would you like it that it isn't executed 1000 times? It doesn't know, so it goes the safe route and falls back to the simple single loop.
Upvotes: 23