Reputation: 1578
I'm trying to optimize my code down to the last possible cycle, and am wondering if the loop type affects performance when used for array indexing?
I've done some experiments with the following program that just fills an array with 0:
int main(int argc, char **argv)
{
typedef int CounterType;
typedef int64_t CounterType;
CounterType N = atoi(argv[1]);
uint8_t volatile dummy[N + 16];
__m128i v = _mm_set1_epi8(0);
for (int j = 0; j < 1000000; ++j)
{
#pragma nounroll
for (CounterType i = 0; i <= N; i+= CounterType(16))
{
_mm_storeu_si128((__m128i *)&dummy[i], v);
}
}
return 0;
}
By using different loop counter types (CounterType) and different compilers, I've recorded the assembly code of the inner loop and performance using hardware performance counters ("perf stat a.out 32768"). I'm running on a Xeon 5670.
GCC4.9, int
.L3
movups %xmm0, (%rax)
addq $16, %rax
movl %eax, %edx
subl %esi, %edx
cmpl %ecx, %edx
jle .L3
4,127,525,521 cycles # 2.934 GHz
12,304,723,292 instructions # 2.98 insns per cycle
GCC4.9, int64
.L7
movups %xmm0, (%rcx,%rax)
addq $16, %rax
cmpq %rax, %rdx
jge .L7
4,123,315,191 cycles # 2.934 GHz
8,206,745,195 instructions # 1.99 insns per cycle
ICC11, int64
..B1.6:
movdqu %xmm0, (%rdx,%rdi)
addq $16, %rdx
incq %rcx
cmpq %rbx, %rcx
jb ..B1.6 # Prob 82% #24.5
2,069,719,166 cycles # 2.934 GHz
5,130,061,268 instructions
(faster because of micro-op fusion?)
ICC11, int
..B1.6: # Preds ..B1.4 ..B1.6
movdqu %xmm0, (%rdx,%rbx) #29.38
addq $16, %rdx #24.37
cmpq %rsi, %rdx #24.34
jle ..B1.6 # Prob 82% #24.34
4,136,109,529 cycles # 2.934 GHz
8,206,897,268 instructions
ICC13, int & int64
movdqu %xmm0, (%rdi,%rax) #29.38
addq $16, %rdi #24.37
cmpq %rsi, %rdi #24.34
jle ..B1.7
4,123,963,321 cycles # 2.934 GHz
8,206,083,789 instructions # 1.99 insns per cycle
The data seems to suggest that int64 is faster. Maybe this is because it matches the pointer size, therefore avoiding any conversions. But I'm not convinced of that conclusion. Another possibility might be that the compiler decided in some cases to do the loop comparison before the store to achieve more parallelism at the cost of 1 extra instruction (due to X86 2 operand instructions being destructive). But that would be incidental, and not fundamentally caused by the loop variable type.
Can some one explain this mystery (preferably knowledgeable about compiler transformations)?
There was also a claim in the CUDA C Best Practices Guide that signed loop counters are simpler than unsigned to generate code for. But that doesn't seem to be relevant here because there are no multiplies in the inner loop for address calculation because that expression gets turned into an induction variable. But apparently in CUDA, it prefers using multiply-add to compute addresses since MADD is 1 instruction just like add and it can cut register use by 1.
Upvotes: 5
Views: 294
Reputation: 33709
Yes the loop variable type can affect efficiency.
Let me suggest an even better solution with GCC.
void distance(uint8_t* dummy, size_t n, const __m128 v0)
{
intptr_t i;
for(i = -n; i < 0; i += 4) {
_mm_store_ps(&((float*)dummy)[i+n], v0);
}
}
With GCC 4.9.2 and GCC 5.3 this produces this main loop
.L5:
vmovaps %xmm0, (%rdi,%rax)
addq $16, %rax
js .L5
Clang 3.6 however still generates the cmp
.LBB0_2: # =>This Inner Loop Header:
vmovaps %xmm0, 8(%rdi,%rax)
addq $4, %rax
cmpq $-4, %rax
jl .LBB0_2
and Clang 3.7 unrolls four times and uses a cmp
.
ICC 13 unrolls twice and uses a cmp
so only GCC manages to do this without the unnecessary cmp
instruction.
Upvotes: 2
Reputation: 366016
gcc 4.9.2 does a really poor job of compiling the version with an int
loop counter. gcc 5.1 and later on godbolt make sane loops:
call strtol
mov edx, eax
...
xor eax, eax
.L7:
movups XMMWORD PTR [rcx+rax], xmm0
add rax, 16
cmp edx, eax
jge .L7 ; while(N >= (idx+=16))
This can run at one iteration per cycle on Intel CPUs (other than L1 cache miss bottlenecks), even if the store doesn't micro-fuse (since cmp/jge macro-fuses into a single uop).
I'm not sure why gcc 4.9.2 makes such a dumb loop. It decides it wants to increment a pointer, but then it subtracts the start address every time to compare against N, instead of computing an end address and using it as the loop condition. It computes i
from its pointer into the array using 32bit operations, which is actually safe since gcc only wants a 32b result. The upper 32b of the inputs don't affect the low 32b of the result, if gcc had done 64bit math.
Upvotes: 1
Reputation: 172
As far as I'm aware loop types do not interfere with performance and execution speed, for optimization only things that count are:
For filling the 2D array with numbers, if you did the above 2, your execution complexity will be
(number of elements) * (number of commands within loop)
Each line in your loop counts as +1 to number of commands.
That's optimization from programming view, only other thing to make it faster is having a better processor which can execute more commands per second but that's up to the user.
EDIT:
Note that in some cases it's faster to use a pointer to the array and fill elements within a single loop rather then having 2 loops. C allows a lot of variations of the same algorithm.
Upvotes: -1