Yale Zhang
Yale Zhang

Reputation: 1578

Does the loop variable type affect efficiency when used to index arrays?

I'm trying to optimize my code down to the last possible cycle, and am wondering if the loop type affects performance when used for array indexing?

I've done some experiments with the following program that just fills an array with 0:

int main(int argc, char **argv)
{
  typedef int CounterType;
  typedef int64_t CounterType;

  CounterType N = atoi(argv[1]);
  uint8_t volatile dummy[N + 16];
  __m128i v = _mm_set1_epi8(0);
  for (int j = 0; j < 1000000; ++j)
  {
    #pragma nounroll
    for (CounterType i = 0; i <= N; i+= CounterType(16))
    {
        _mm_storeu_si128((__m128i *)&dummy[i], v);
    }
  }
  return 0;
}

By using different loop counter types (CounterType) and different compilers, I've recorded the assembly code of the inner loop and performance using hardware performance counters ("perf stat a.out 32768"). I'm running on a Xeon 5670.

GCC4.9, int

.L3
movups  %xmm0, (%rax)
addq    $16, %rax
movl    %eax, %edx
subl    %esi, %edx
cmpl    %ecx, %edx
jle     .L3

 4,127,525,521      cycles                    #    2.934 GHz
12,304,723,292      instructions              #    2.98  insns per cycle

GCC4.9, int64

.L7
movups  %xmm0, (%rcx,%rax)
addq    $16, %rax
cmpq    %rax, %rdx
jge     .L7
4,123,315,191      cycles                    #    2.934 GHz
8,206,745,195      instructions              #    1.99  insns per cycle

ICC11, int64

..B1.6:
movdqu    %xmm0, (%rdx,%rdi)
addq      $16, %rdx
incq      %rcx
cmpq      %rbx, %rcx
jb        ..B1.6        # Prob 82%                      #24.5
2,069,719,166      cycles                    #    2.934 GHz
5,130,061,268      instructions

(faster because of micro-op fusion?)

ICC11, int

..B1.6:                         # Preds ..B1.4 ..B1.6
 movdqu    %xmm0, (%rdx,%rbx)                            #29.38
 addq      $16, %rdx                                     #24.37
 cmpq      %rsi, %rdx                                    #24.34
 jle       ..B1.6        # Prob 82%                      #24.34
4,136,109,529      cycles                    #    2.934 GHz                
8,206,897,268      instructions    

ICC13, int & int64

movdqu    %xmm0, (%rdi,%rax)                            #29.38
addq      $16, %rdi                                     #24.37
cmpq      %rsi, %rdi                                    #24.34
jle       ..B1.7       
4,123,963,321      cycles                    #    2.934 GHz
8,206,083,789      instructions              #    1.99  insns per cycle

The data seems to suggest that int64 is faster. Maybe this is because it matches the pointer size, therefore avoiding any conversions. But I'm not convinced of that conclusion. Another possibility might be that the compiler decided in some cases to do the loop comparison before the store to achieve more parallelism at the cost of 1 extra instruction (due to X86 2 operand instructions being destructive). But that would be incidental, and not fundamentally caused by the loop variable type.

Can some one explain this mystery (preferably knowledgeable about compiler transformations)?

There was also a claim in the CUDA C Best Practices Guide that signed loop counters are simpler than unsigned to generate code for. But that doesn't seem to be relevant here because there are no multiplies in the inner loop for address calculation because that expression gets turned into an induction variable. But apparently in CUDA, it prefers using multiply-add to compute addresses since MADD is 1 instruction just like add and it can cut register use by 1.

Upvotes: 5

Views: 294

Answers (3)

Z boson
Z boson

Reputation: 33709

Yes the loop variable type can affect efficiency.

Let me suggest an even better solution with GCC.

void distance(uint8_t* dummy, size_t n, const __m128 v0)
{
    intptr_t i;
    for(i = -n; i < 0; i += 4) {
        _mm_store_ps(&((float*)dummy)[i+n], v0);
    }
}

With GCC 4.9.2 and GCC 5.3 this produces this main loop

.L5:
        vmovaps %xmm0, (%rdi,%rax)
        addq    $16, %rax
        js      .L5

Clang 3.6 however still generates the cmp

.LBB0_2:                                # =>This Inner Loop Header: 
        vmovaps %xmm0, 8(%rdi,%rax)
        addq    $4, %rax
        cmpq    $-4, %rax
        jl      .LBB0_2

and Clang 3.7 unrolls four times and uses a cmp.

ICC 13 unrolls twice and uses a cmp so only GCC manages to do this without the unnecessary cmp instruction.

Upvotes: 2

Peter Cordes
Peter Cordes

Reputation: 366016

gcc 4.9.2 does a really poor job of compiling the version with an int loop counter. gcc 5.1 and later on godbolt make sane loops:

    call    strtol
    mov     edx, eax
   ...
    xor     eax, eax
.L7:
    movups  XMMWORD PTR [rcx+rax], xmm0
    add     rax, 16
    cmp     edx, eax
    jge     .L7        ; while(N >= (idx+=16))

This can run at one iteration per cycle on Intel CPUs (other than L1 cache miss bottlenecks), even if the store doesn't micro-fuse (since cmp/jge macro-fuses into a single uop).

I'm not sure why gcc 4.9.2 makes such a dumb loop. It decides it wants to increment a pointer, but then it subtracts the start address every time to compare against N, instead of computing an end address and using it as the loop condition. It computes i from its pointer into the array using 32bit operations, which is actually safe since gcc only wants a 32b result. The upper 32b of the inputs don't affect the low 32b of the result, if gcc had done 64bit math.

Upvotes: 1

Survaf93
Survaf93

Reputation: 172

As far as I'm aware loop types do not interfere with performance and execution speed, for optimization only things that count are:

  1. Don't let the loop run more then required
  2. Break if specific condition is fulfilled

For filling the 2D array with numbers, if you did the above 2, your execution complexity will be

(number of elements) * (number of commands within loop)

Each line in your loop counts as +1 to number of commands.

That's optimization from programming view, only other thing to make it faster is having a better processor which can execute more commands per second but that's up to the user.

EDIT:

Note that in some cases it's faster to use a pointer to the array and fill elements within a single loop rather then having 2 loops. C allows a lot of variations of the same algorithm.

Upvotes: -1

Related Questions