Iskander Sharipov
Iskander Sharipov

Reputation: 275

Is leaq THAT slow or there is another reason smaller assembly listing is slower than the longer one?

I do not know any real assembly, but can read GCC -S output to evaluate actual costs of given C code.

This question is not that much about profiling and benchmarks, but rather educational. I need someone to explain me why [1] snippet is not faster than the second one.

Well, used to think like: "yeah, some operations like MUL are pretty expensive, but if one assembly is X times bigger than another, it should be slower".

This was quite true until I have met those two:

unsigned char bytes[4] = {0, 0, 0, 5};

// 1
int32_t val = *((int32_t*)bytes);      
/* produces:
        leaq    -16(%rbp), %rax
        movl    (%rax), %eax
        movl    %eax, -4(%rbp)
        movl    $0, %eax
*/

// 2   
val = bytes[3] |                               
      (bytes[2] << 8) |                        
      (bytes[1] << 16) |
      (bytes[0] << 24);
/* produces: 
        movzbl  -13(%rbp), %eax
        movzbl  %al, %eax
        movzbl  -14(%rbp), %edx
        movzbl  %dl, %edx
        sall    $8, %edx
        orl     %eax, %edx
        movzbl  -15(%rbp), %eax
        movzbl  %al, %eax
        sall    $16, %eax
        orl     %eax, %edx
        movzbl  -16(%rbp), %eax
        movzbl  %al, %eax
        sall    $24, %eax
        orl     %edx, %eax
        movl    %eax, -4(%rbp)
        movl    $0, %eax
*/

And benchmarks are showing that the 2-nd one is 5-10% faster. What is going on here?

The only significant difference and "reason" I can imagine is LEAQ is something very slow. Last 2 lines are identical, so maybe MOV price is so high that 1 extra MOV is worse than tons of instructions.

Here is what I used to measure execution times:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

#define REPETITIONS 32
#define ITERATIONS 90000

#define CODE1                   \
  for (int i = 0; i < ITERATIONS; ++i) {    \
    val = *((int32_t*)bytes);           \
  }

#define CODE2                   \
  for (int i = 0; i < ITERATIONS; ++i) {    \
    val = bytes[3] |                \
      (bytes[2] << 8) |             \
      (bytes[1] << 16) |            \
      (bytes[0] << 24);             \
  }

int main(void) {
  clock_t minTs = 999999999, ts;
  unsigned char bytes[4] = {0, 0, 0, 5};    
  int32_t val;                  

  for (int i = 0; i < REPETITIONS; ++i) {
    ts = clock();

    CODE1; // Or CODE2

    ts = clock() - ts;
    if (ts < minTs) minTs = ts;
  }

  printf("ts: %ld\n", minTs);

  return val;
}


Update: as it turns out, results are hardware specific, so while [1] is slower on my laptop (x64 i5-4260U), it is faster on my PC (but by a very small fraction like 5%).

Upvotes: 1

Views: 703

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 366016

Update: extra loads/stores can reduce store-forwarding latency on Sandybridge-family, so the extra work in CODE2 probably sped up the loop-carried dependency chain that both loops bottleneck on. (Because -O0 keeps the loop counter in memory).

This effect is present in all SnB-family CPUs, including the Sandybridge I had when writing this answer, the OP's Haswell, the Broadwell in Adding a redundant assignment speeds up code when compiled without optimization, and the Skylake in Loop with function call faster than an empty loop.


what you should have done

If you're curious about how your code compiles to asm, put your code in a function where it can't get optimized away. On this godbolt link, you can see how gcc 5.1 and newer (at -O3 -march=native -mtune=native) sees through the ORing together of bytes and uses movbe (move big-endian) to bwap on the fly while loading. icc, clang, and older gcc emit instructions that load individual bytes and shift / OR them into place.

I was disappointed that compilers didn't see through the ORing of bytes even when I reversed the order to do a little-endian load instead of a big-endian load. (see the native, big, and little endian functions on godbolt.) Even changing the types to uint32_t and uint8_t didn't help for compilers other than gcc >= 5.1.

Obviously with optimization on, compilers throw away the loops that just set an unused variable. They just call clock twice, printf, and then mov-immediate the answer into eax. If you want to actually benchmark something, compile it separately from the function that will call it with constant data. That way, you can have simple code that takes its work as function arguments, and it can't get inlined into the caller that passes it constant data.

Also, gcc treats main as a "cold" function, and doesn't optimize it quite as heavily as normal functions. In most programs, main doesn't contain the inner loop.


Why is the -O0 asm so slow?

Obviously the code is horrible from -O0, storing to memory, and even incrementing the loop counter in memory. It's still somewhat interesting to figure out why it's running even slower than I'd expect, CODE1 at under one insn per clock.

You didn't show the whole loop for either piece of code. Probably removing the loop body would still leave a slow loop. I think the loop itself is the problem, and is so inefficient that there's time for the CPU to do all the extra instructions in CODE2 without seeing any slowdown.


TL;DR: both loops are bottlenecked by the add $1, -0x24(%rbp), which increments the loop counter in memory. 6 cycle latency in a loop-carried dependency chain explains why both loops are bottlenecked to the same speed.

I don't know exactly why the extra instructions in CODE2 somehow help get closer to the 6 cycles per iteration theoretical max, but this isn't a bottleneck that should ever come up in code that anyone would ever write. Keep your loop counters in registers, and don't put include a read-modify-write instruction of the same address in a loop-carried dependency chain. (incrementing memory at different addresses each iteration is fine, e.g. for CountingSort.)


See godbolt for the changes I made to the code. (With ITERATIONS bumped up by a factor of 100 so runtime dominates the noise of startup overhead.) That link has optimization enabled, for the benefit of the first 3 functions.

godbolt doesn't have a C mode, only C++, and I got a less-bad loop from gcc 4.9.2 locally than godbolt shows. (g++ implements a for(){} loop exactly as written, with a cmp/jcc at the top, and a jmp at the bottom. gcc even at -O0 uses a do{} while(count++ < ITERATIONS); structure, with just a cmp/jle at the bottom.

I do not know any real assembly, but can read GCC -S output to evaluate actual costs of given C code.

Well, used to think like: "yeah, some operations like MUL are pretty expensive, but if one assembly is X times bigger than another, it should be slower".

The first thing to look for is throughput and latency bottlenecks. On Intel, that means 4 uops per clock throughput, or less if a long dependency chain is a limit. Then there's per-execution-port bottlenecks. E.g. two memory ops per clock, with at most one of them being a store. At most one mul per clock, because only one of the 3 ALU ports has an integer multiplier.

See Agner Fog's site for optimization guides, micro-architecture docs, and tables of instruction latency/throughputs / uops / ports they can run on.

Your loops are badly bottlenecked by keeping the loop counter in memory. On SandyBridge (my sytem) and Haswell (yours), Agner Fog's table has the latency of add with a memory destination at 6 clocks. There's no way it can run any faster than one iteration per 6 clocks per iteration. With 6 instructions in the loop, that's 1 insn per cycle.

In practice, I'm getting less throughput than that. Maybe the store as part of add's read-modify-write operation is sometimes delayed by the other loads/stores in the loop. IDK why CODE2 is slightly faster, that is strange. Maybe it orders things differently so the loop-carried-dependency add is delayed less often.

The loop body using lea and a 32bit load is obviously faster. IDK why you think it's the lea that's slow.

It's not an an alignment / uop-cache issue. The loop should stream from the loop buffer either way, even if there were more than 18 uops in a 32B block of code (meaning it couldn't go in the uop cache). Frontend bottlenecks (other than branch mispredicts, which we don't have) can't be an issue when our insns per clock are so low. The frontend can easily keep a big supply of uops queued up for the scheduler to dispatch.

From perf report, profiling clocks taken on each instruction: CODE1's inner loop. Counts aren't cycle-accurate. We're probably seeing the CPU stuck on the instructions right after the add $1, mem, which I'm sure is the bottleneck loop-carried dependency. It needs to forward the store to the load on the next iteration, which still takes 6 clocks.

   ###### CODE1 inner loop, profiling on cycles
 13.97 │400680:   lea    -0x10(%rbp),%rax
       │400684:   mov    (%rax),%eax
       │400686:   mov    %eax,-0x2c(%rbp)
       │400689:   addl   $0x1,-0x24(%rbp)
 13.84 │40068d:   cmpl   $0x89543f,-0x24(%rbp)
 72.19 │400694: ↑ jle    400680 <code1+0x4a>
       ## end of the loop
        400696:   callq  4004e0 <clock@plt>
        40069b:   sub    -0x18(%rbp),%rax

       #CODE2
 15.08 │400738:   movzbl -0xd(%rbp),%eax
  0.88 │40073c:   movzbl %al,%eax
  0.05 │40073f:   movzbl -0xe(%rbp),%edx
       │400743:   movzbl %dl,%edx
 13.91 │400746:   shl    $0x8,%edx
  0.70 │400749:   or     %eax,%edx
  0.05 │40074b:   movzbl -0xf(%rbp),%eax
       │40074f:   movzbl %al,%eax
 12.70 │400752:   shl    $0x10,%eax
  0.60 │400755:   or     %eax,%edx
  0.09 │400757:   movzbl -0x10(%rbp),%eax
  0.05 │40075b:   movzbl %al,%eax
 13.03 │40075e:   shl    $0x18,%eax
  0.70 │400761:   or     %edx,%eax
  0.14 │400763:   mov    %eax,-0x2c(%rbp)
  0.09 │400766:   addl   $0x1,-0x24(%rbp)
 13.63 │40076a:   cmpl   $0x89543f,-0x24(%rbp)
 28.29 │400771: ↑ jle    400738 <code2+0x4a>
     ## end of the loop
        400773: → callq  4004e0 <clock@plt>
        400778:   sub    -0x18(%rbp),%rax 

Wow, that's pretty hilarious. gcc does a redundant movzbl %al, %eax after loading %eax from an 8bit memory location with movzbl.

So in 6 clocks per iteration, can the CPU handle all that busy-work of loading an combining bytes? Yes.

  • 4x movzx reg, mem: 4 load-port uops. (p2/p3)
  • 4x movzx reg, reg: 4 uops for any ALU port (p015)
  • 3x shl reg, imm: 3 uops for ALU ports p0/p5
  • 3x or reg, reg: 3 uops for any ALU port (p015)
  • 1x mov mem, reg: 1 uop fused-domain: 1 store-data (p4), 1 store-address (p23)
  • 1x add mem, imm: 2 fused-domain. unfused: 1 ALU uop (p015), 1 load uop (p23), 1 store-data (p4), 1 store-address (p23)
  • 1x cmp mem, imm: 1 uop for p015, 1 for p23.
  • 1x jle: 1 uop for p5. (can't macro-fuse with the cmp, because of imm and mem)

total fused-domain uops: 4+4+3+3+1+2+1+1 = 19. That fits in the 28uop loop stream buffer, avoiding any possiblity of uop-cache bottlenecks, and can issue in 5 clocks. (At 4 per cycle, with the last cycle issuing only 3).

load uops: 4 + 1 + 1 = 6. store uops: 2.

ALU uops: 4+3+3+1+1+1 = 13. SnB's 3 ALU uop ports can handle that in 4.33 clocks. Most of the uops can run on any port, so no one port is a bottleneck. (Haswell has a 4th ALU port, p6. It has an even easier time. But ALU uops aren't the bottleneck.)

The latency of the loop bodies don't matter, because the next iteration doesn't read any result. Each iteration reads some data and stores it, independently of what the previous iteration did. Many loops are like this. It's usual for such loops to load from and store to a different address each time, but the CPU just does what it's told.

Anyway, even if the dependency chain within each loop takes more than 6 clocks, work from multiple iterations can be in flight. Nothing in one iteration has to wait for the previous, except the loop-counter increment with a memory destination.

So all that work in the CODE2 loop is not a bottleneck at all.

For SnB/HSW, add-immediate with a memory destination is 2 uops, while inc on a memory destination is 3, according to Agner Fog's table, which is strange. I wonder if that's an error, or if Intel CPUs really are slower when using inc on a memory destination instead of add $1?


Test timings (from gcc 4.9.2). I don't see anything obvious that would explain why CODE2 gets closer to the theoretical max of one iteration per 6 clocks. My only guess is that CODE1 is confused by the call right after the jle, but CODE1 isn't? Maybe a perf record on uops

Sandybridge i5 (2500k):

## CODE1 ##
 Performance counter stats for './a.out 1' (4 runs):

        589.117019      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.30% )
     2,068,118,847      cycles                    #    3.511 GHz                      ( +-  0.48% )
     1,729,505,746      instructions              #    0.84  insns per cycle        
                                                  #    0.86  stalled cycles per insn  ( +-  0.01% )
     2,018,533,639      uops_issued_any           # 3426.371 M/sec                    ( +-  0.02% )
     5,648,003,370      uops_dispatched_thread    # 9587.235 M/sec                    ( +-  2.51% )
     3,170,497,726      uops_retired_all          # 5381.779 M/sec                    ( +-  0.01% )
     2,018,126,488      uops_retired_retire_slots # 3425.680 M/sec                    ( +-  0.01% )
     1,491,267,343      stalled-cycles-frontend   #   72.11% frontend cycles idle     ( +-  0.66% )
        27,192,780      stalled-cycles-backend    #    1.31% backend  cycles idle     ( +- 68.75% )

       0.589651097 seconds time elapsed                                          ( +-  1.32% )

It's very unusual to see uops_dispatched_thread not matching uops_retired_all. They're usually both the same, and equal to the number of unfused uops for instructions in the loop. Fused-domain uops_issued_any and uops_retired_retire_slots are usually both equal, which they are in this case. Maybe memory-destination ALU ops count differently in dispatched vs. retired_all? (micro-fusion). I think my previous testing has only looked at micro-fusion of loads.

I don't think it's issuing uops that turn out not to be needed. (It's not a branch-mispredict issue; I checked, and both versions have 0.00% branch mispredicts. (only ~10k for 288M branches).)

## CODE2 ##
peter@tesla:~/src/SO$ ocperf.py stat -r4 -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./a.out 2
perf stat -r4 -e task-clock,cycles,instructions,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_dispatched_thread/,cpu/event=0xc2,umask=0x1,name=uops_retired_all/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,stalled-cycles-frontend,stalled-cycles-backend ./a.out 2
CODE2: ts: 16499
CODE2: ts: 16535
CODE2: ts: 16517
CODE2: ts: 16529

 Performance counter stats for './a.out 2' (4 runs):

        543.886795      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.01% )
     2,007,623,288      cycles                    #    3.691 GHz                      ( +-  0.01% )
     5,185,222,133      instructions              #    2.58  insns per cycle        
                                                  #    0.11  stalled cycles per insn  ( +-  0.00% )
     5,474,164,892      uops_issued_any           # 10064.898 M/sec                    ( +-  0.00% )
     7,586,159,182      uops_dispatched_thread    # 13948.048 M/sec                    ( +-  0.00% )
     6,626,044,316      uops_retired_all          # 12182.764 M/sec                    ( +-  0.00% )
     5,473,742,754      uops_retired_retire_slots # 10064.121 M/sec                    ( +-  0.00% )
       566,873,826      stalled-cycles-frontend   #   28.24% frontend cycles idle     ( +-  0.03% )
         3,744,569      stalled-cycles-backend    #    0.19% backend  cycles idle     ( +-  2.32% )

       0.544190971 seconds time elapsed                                          ( +-  1.01% )

In the fused-domain, issued & retired_slots match for CODE2.

Upvotes: 6

Related Questions