einpoklum
einpoklum

Reputation: 132108

Are scaled-index addressing modes a good idea?

Consider the following code:

void foo(int* __restrict__ a)
{
    int i; int val = 0;
    for (i = 0; i < 100; i++) {
        val = 2 * i;
        a[i] = val;
    }
}

This complies (with maximum optimization but no unrolling or vectorization) into...

GCC 7.2:

foo(int*):
        xor     eax, eax
.L2:
        mov     DWORD PTR [rdi], eax
        add     eax, 2
        add     rdi, 4
        cmp     eax, 200
        jne     .L2
        rep ret

clang 5.0:

foo(int*): # @foo(int*)
  xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
  mov dword ptr [rdi + 2*rax], eax
  add rax, 2
  cmp rax, 200
  jne .LBB0_1
  ret

What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?

Notes:

Upvotes: 6

Views: 1486

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365537

Often no, because in some cases Intel CPUs end up running them as more uops. Especially when unrolling, one extra pointer-increment instruction can save multiple uops in later instructions like vpaddb ymm0, ymm1, [rsi] (single uop in the fused-domain) vs. with [rsi + rcx*4]. On Haswell and later, instructions like add eax, [rsi + rcx*4] or an equivalent SSE2 paddd can stay micro-fused, but indexed stores still can't run on port 7 (until Ice Lake separates things.)

Otherwise yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments. An indexed addressing mode does take an extra byte of code size.

(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)

It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)

One trick is to index one array relative to the other, like so for AVX2.

add_arrays:    ; int *a in RDI,  const int *b in RSI, int *enda in RDX
               ; a[i] += b[i]
   ; maybe sub rdx, 28 or something to make the loop bound work
   sub  rsi, rdi        ; b-a
 .loop:
   vmovdqu  ymm0, [rsi + rdi]    ; b[i]  as aptr + (b-a)
   vpaddd   ymm0, ymm0, [rdi]    ; a[i]  as aptr
   vmovdqu  [rdi], ymm0
   add      rdi, 32
   cmp      rdi, rdx
   jb    .loop                  ; }while(aptr < enda)

... partial vector cleanup, etc.

Notice how the only instruction using an indexed addressing mode is a vmovdqu load, which is always single-uop and isn't limited in what ports it can run on.

If indexed addressing modes weren't a problem at all, another trick is to do add rdi, rdx / add rsi, rdx / neg rdx to index relative to the ends of the arrays, counting up toward zero, so the loop overhead can be just add rdx, 8 / jg or jnc or something to stop at or past zero. But then all addressing modes would have to be [rsi + rdx*4] or [rdi + rdx*4].

(See also How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 bit avx vector? for getting the loop bounds and cleanup correct for SIMD loops.)


Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.

At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).

More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.

It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).


Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.

The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.


Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.

The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.

You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).

You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).

Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)

gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).

Upvotes: 4

user555045
user555045

Reputation: 64913

Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example

  • On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
  • On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
  • On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.

So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Upvotes: 2

Related Questions