Reputation: 275
I do not know any real assembly, but can read GCC -S
output to evaluate actual costs of given C code.
This question is not that much about profiling and benchmarks, but rather educational. I need someone to explain me why [1] snippet is not faster than the second one.
Well, used to think like: "yeah, some operations like MUL are pretty expensive, but if one assembly is X times bigger than another, it should be slower".
This was quite true until I have met those two:
unsigned char bytes[4] = {0, 0, 0, 5};
// 1
int32_t val = *((int32_t*)bytes);
/* produces:
leaq -16(%rbp), %rax
movl (%rax), %eax
movl %eax, -4(%rbp)
movl $0, %eax
*/
// 2
val = bytes[3] |
(bytes[2] << 8) |
(bytes[1] << 16) |
(bytes[0] << 24);
/* produces:
movzbl -13(%rbp), %eax
movzbl %al, %eax
movzbl -14(%rbp), %edx
movzbl %dl, %edx
sall $8, %edx
orl %eax, %edx
movzbl -15(%rbp), %eax
movzbl %al, %eax
sall $16, %eax
orl %eax, %edx
movzbl -16(%rbp), %eax
movzbl %al, %eax
sall $24, %eax
orl %edx, %eax
movl %eax, -4(%rbp)
movl $0, %eax
*/
And benchmarks are showing that the 2-nd one is 5-10% faster. What is going on here?
The only significant difference and "reason" I can imagine is LEAQ
is something very slow.
Last 2 lines are identical, so maybe MOV
price is so high that 1 extra MOV
is worse than tons of instructions.
Here is what I used to measure execution times:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#define REPETITIONS 32
#define ITERATIONS 90000
#define CODE1 \
for (int i = 0; i < ITERATIONS; ++i) { \
val = *((int32_t*)bytes); \
}
#define CODE2 \
for (int i = 0; i < ITERATIONS; ++i) { \
val = bytes[3] | \
(bytes[2] << 8) | \
(bytes[1] << 16) | \
(bytes[0] << 24); \
}
int main(void) {
clock_t minTs = 999999999, ts;
unsigned char bytes[4] = {0, 0, 0, 5};
int32_t val;
for (int i = 0; i < REPETITIONS; ++i) {
ts = clock();
CODE1; // Or CODE2
ts = clock() - ts;
if (ts < minTs) minTs = ts;
}
printf("ts: %ld\n", minTs);
return val;
}
Upvotes: 1
Views: 703
Reputation: 366016
Update: extra loads/stores can reduce store-forwarding latency on Sandybridge-family, so the extra work in CODE2 probably sped up the loop-carried dependency chain that both loops bottleneck on. (Because -O0
keeps the loop counter in memory).
This effect is present in all SnB-family CPUs, including the Sandybridge I had when writing this answer, the OP's Haswell, the Broadwell in Adding a redundant assignment speeds up code when compiled without optimization, and the Skylake in Loop with function call faster than an empty loop.
If you're curious about how your code compiles to asm, put your code in a function where it can't get optimized away. On this godbolt link, you can see how gcc 5.1 and newer (at -O3 -march=native -mtune=native
) sees through the ORing together of bytes and uses movbe
(move big-endian) to bwap on the fly while loading. icc, clang, and older gcc emit instructions that load individual bytes and shift / OR them into place.
I was disappointed that compilers didn't see through the ORing of bytes even when I reversed the order to do a little-endian load instead of a big-endian load. (see the native, big, and little endian functions on godbolt.) Even changing the types to uint32_t and uint8_t didn't help for compilers other than gcc >= 5.1.
Obviously with optimization on, compilers throw away the loops that just set an unused variable. They just call clock
twice, printf, and then mov-immediate the answer into eax
. If you want to actually benchmark something, compile it separately from the function that will call it with constant data. That way, you can have simple code that takes its work as function arguments, and it can't get inlined into the caller that passes it constant data.
Also, gcc treats main
as a "cold" function, and doesn't optimize it quite as heavily as normal functions. In most programs, main
doesn't contain the inner loop.
-O0
asm so slow?Obviously the code is horrible from -O0
, storing to memory, and even incrementing the loop counter in memory. It's still somewhat interesting to figure out why it's running even slower than I'd expect, CODE1 at under one insn per clock.
You didn't show the whole loop for either piece of code. Probably removing the loop body would still leave a slow loop. I think the loop itself is the problem, and is so inefficient that there's time for the CPU to do all the extra instructions in CODE2 without seeing any slowdown.
TL;DR: both loops are bottlenecked by the add $1, -0x24(%rbp)
, which increments the loop counter in memory. 6 cycle latency in a loop-carried dependency chain explains why both loops are bottlenecked to the same speed.
I don't know exactly why the extra instructions in CODE2 somehow help get closer to the 6 cycles per iteration theoretical max, but this isn't a bottleneck that should ever come up in code that anyone would ever write. Keep your loop counters in registers, and don't put include a read-modify-write instruction of the same address in a loop-carried dependency chain. (incrementing memory at different addresses each iteration is fine, e.g. for CountingSort.)
See godbolt for the changes I made to the code. (With ITERATIONS bumped up by a factor of 100 so runtime dominates the noise of startup overhead.) That link has optimization enabled, for the benefit of the first 3 functions.
godbolt doesn't have a C mode, only C++, and I got a less-bad loop from gcc 4.9.2 locally than godbolt shows. (g++ implements a for(){}
loop exactly as written, with a cmp/jcc at the top, and a jmp at the bottom. gcc even at -O0
uses a do{} while(count++ < ITERATIONS);
structure, with just a cmp/jle at the bottom.
I do not know any real assembly, but can read GCC -S output to evaluate actual costs of given C code.
Well, used to think like: "yeah, some operations like MUL are pretty expensive, but if one assembly is X times bigger than another, it should be slower".
The first thing to look for is throughput and latency bottlenecks. On Intel, that means 4 uops per clock throughput, or less if a long dependency chain is a limit. Then there's per-execution-port bottlenecks. E.g. two memory ops per clock, with at most one of them being a store. At most one mul
per clock, because only one of the 3 ALU ports has an integer multiplier.
See Agner Fog's site for optimization guides, micro-architecture docs, and tables of instruction latency/throughputs / uops / ports they can run on.
Your loops are badly bottlenecked by keeping the loop counter in memory. On SandyBridge (my sytem) and Haswell (yours), Agner Fog's table has the latency of add
with a memory destination at 6 clocks. There's no way it can run any faster than one iteration per 6 clocks per iteration. With 6 instructions in the loop, that's 1 insn per cycle.
In practice, I'm getting less throughput than that. Maybe the store as part of add
's read-modify-write operation is sometimes delayed by the other loads/stores in the loop. IDK why CODE2 is slightly faster, that is strange. Maybe it orders things differently so the loop-carried-dependency add
is delayed less often.
The loop body using lea
and a 32bit load is obviously faster. IDK why you think it's the lea
that's slow.
It's not an an alignment / uop-cache issue. The loop should stream from the loop buffer either way, even if there were more than 18 uops in a 32B block of code (meaning it couldn't go in the uop cache). Frontend bottlenecks (other than branch mispredicts, which we don't have) can't be an issue when our insns per clock are so low. The frontend can easily keep a big supply of uops queued up for the scheduler to dispatch.
From perf report
, profiling clocks taken on each instruction: CODE1's inner loop. Counts aren't cycle-accurate. We're probably seeing the CPU stuck on the instructions right after the add $1, mem
, which I'm sure is the bottleneck loop-carried dependency. It needs to forward the store to the load on the next iteration, which still takes 6 clocks.
###### CODE1 inner loop, profiling on cycles
13.97 │400680: lea -0x10(%rbp),%rax
│400684: mov (%rax),%eax
│400686: mov %eax,-0x2c(%rbp)
│400689: addl $0x1,-0x24(%rbp)
13.84 │40068d: cmpl $0x89543f,-0x24(%rbp)
72.19 │400694: ↑ jle 400680 <code1+0x4a>
## end of the loop
400696: callq 4004e0 <clock@plt>
40069b: sub -0x18(%rbp),%rax
#CODE2
15.08 │400738: movzbl -0xd(%rbp),%eax
0.88 │40073c: movzbl %al,%eax
0.05 │40073f: movzbl -0xe(%rbp),%edx
│400743: movzbl %dl,%edx
13.91 │400746: shl $0x8,%edx
0.70 │400749: or %eax,%edx
0.05 │40074b: movzbl -0xf(%rbp),%eax
│40074f: movzbl %al,%eax
12.70 │400752: shl $0x10,%eax
0.60 │400755: or %eax,%edx
0.09 │400757: movzbl -0x10(%rbp),%eax
0.05 │40075b: movzbl %al,%eax
13.03 │40075e: shl $0x18,%eax
0.70 │400761: or %edx,%eax
0.14 │400763: mov %eax,-0x2c(%rbp)
0.09 │400766: addl $0x1,-0x24(%rbp)
13.63 │40076a: cmpl $0x89543f,-0x24(%rbp)
28.29 │400771: ↑ jle 400738 <code2+0x4a>
## end of the loop
400773: → callq 4004e0 <clock@plt>
400778: sub -0x18(%rbp),%rax
Wow, that's pretty hilarious. gcc does a redundant movzbl %al, %eax
after loading %eax
from an 8bit memory location with movzbl
.
So in 6 clocks per iteration, can the CPU handle all that busy-work of loading an combining bytes? Yes.
movzx reg, mem
: 4 load-port uops. (p2/p3)movzx reg, reg
: 4 uops for any ALU port (p015)shl reg, imm
: 3 uops for ALU ports p0/p5or reg, reg
: 3 uops for any ALU port (p015)mov mem, reg
: 1 uop fused-domain: 1 store-data (p4), 1 store-address (p23)add mem, imm
: 2 fused-domain. unfused: 1 ALU uop (p015), 1 load uop (p23), 1 store-data (p4), 1 store-address (p23)cmp mem, imm
: 1 uop for p015, 1 for p23.jle
: 1 uop for p5. (can't macro-fuse with the cmp, because of imm and mem)total fused-domain uops: 4+4+3+3+1+2+1+1 = 19. That fits in the 28uop loop stream buffer, avoiding any possiblity of uop-cache bottlenecks, and can issue in 5 clocks. (At 4 per cycle, with the last cycle issuing only 3).
load uops: 4 + 1 + 1 = 6. store uops: 2.
ALU uops: 4+3+3+1+1+1 = 13. SnB's 3 ALU uop ports can handle that in 4.33 clocks. Most of the uops can run on any port, so no one port is a bottleneck. (Haswell has a 4th ALU port, p6. It has an even easier time. But ALU uops aren't the bottleneck.)
The latency of the loop bodies don't matter, because the next iteration doesn't read any result. Each iteration reads some data and stores it, independently of what the previous iteration did. Many loops are like this. It's usual for such loops to load from and store to a different address each time, but the CPU just does what it's told.
Anyway, even if the dependency chain within each loop takes more than 6 clocks, work from multiple iterations can be in flight. Nothing in one iteration has to wait for the previous, except the loop-counter increment with a memory destination.
So all that work in the CODE2 loop is not a bottleneck at all.
For SnB/HSW, add-immediate with a memory destination is 2 uops, while inc on a memory destination is 3, according to Agner Fog's table, which is strange. I wonder if that's an error, or if Intel CPUs really are slower when using inc
on a memory destination instead of add $1
?
Test timings (from gcc 4.9.2). I don't see anything obvious that would explain why CODE2 gets closer to the theoretical max of one iteration per 6 clocks. My only guess is that CODE1 is confused by the call
right after the jle
, but CODE1 isn't? Maybe a perf record on uops
Sandybridge i5 (2500k):
## CODE1 ##
Performance counter stats for './a.out 1' (4 runs):
589.117019 task-clock (msec) # 0.999 CPUs utilized ( +- 1.30% )
2,068,118,847 cycles # 3.511 GHz ( +- 0.48% )
1,729,505,746 instructions # 0.84 insns per cycle
# 0.86 stalled cycles per insn ( +- 0.01% )
2,018,533,639 uops_issued_any # 3426.371 M/sec ( +- 0.02% )
5,648,003,370 uops_dispatched_thread # 9587.235 M/sec ( +- 2.51% )
3,170,497,726 uops_retired_all # 5381.779 M/sec ( +- 0.01% )
2,018,126,488 uops_retired_retire_slots # 3425.680 M/sec ( +- 0.01% )
1,491,267,343 stalled-cycles-frontend # 72.11% frontend cycles idle ( +- 0.66% )
27,192,780 stalled-cycles-backend # 1.31% backend cycles idle ( +- 68.75% )
0.589651097 seconds time elapsed ( +- 1.32% )
It's very unusual to see uops_dispatched_thread not matching uops_retired_all. They're usually both the same, and equal to the number of unfused uops for instructions in the loop. Fused-domain uops_issued_any and uops_retired_retire_slots are usually both equal, which they are in this case. Maybe memory-destination ALU ops count differently in dispatched vs. retired_all? (micro-fusion). I think my previous testing has only looked at micro-fusion of loads.
I don't think it's issuing uops that turn out not to be needed. (It's not a branch-mispredict issue; I checked, and both versions have 0.00% branch mispredicts. (only ~10k for 288M branches).)
## CODE2 ##
peter@tesla:~/src/SO$ ocperf.py stat -r4 -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./a.out 2
perf stat -r4 -e task-clock,cycles,instructions,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_dispatched_thread/,cpu/event=0xc2,umask=0x1,name=uops_retired_all/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,stalled-cycles-frontend,stalled-cycles-backend ./a.out 2
CODE2: ts: 16499
CODE2: ts: 16535
CODE2: ts: 16517
CODE2: ts: 16529
Performance counter stats for './a.out 2' (4 runs):
543.886795 task-clock (msec) # 0.999 CPUs utilized ( +- 1.01% )
2,007,623,288 cycles # 3.691 GHz ( +- 0.01% )
5,185,222,133 instructions # 2.58 insns per cycle
# 0.11 stalled cycles per insn ( +- 0.00% )
5,474,164,892 uops_issued_any # 10064.898 M/sec ( +- 0.00% )
7,586,159,182 uops_dispatched_thread # 13948.048 M/sec ( +- 0.00% )
6,626,044,316 uops_retired_all # 12182.764 M/sec ( +- 0.00% )
5,473,742,754 uops_retired_retire_slots # 10064.121 M/sec ( +- 0.00% )
566,873,826 stalled-cycles-frontend # 28.24% frontend cycles idle ( +- 0.03% )
3,744,569 stalled-cycles-backend # 0.19% backend cycles idle ( +- 2.32% )
0.544190971 seconds time elapsed ( +- 1.01% )
In the fused-domain, issued & retired_slots match for CODE2.
Upvotes: 6