Reputation: 15532
While reading sources of GNU C++ standard library, I found some code for copying (or moving, if possible) a range of iterators (file stl_algobase.h
), which uses template specialization for some optimizations. A comment corresponding to it says:
All of these auxiliary structs serve two purposes. (1) Replace calls to copy with memmove whenever possible. (Memmove, not memcpy, because the input and output ranges are permitted to overlap.) (2) If we're using random access iterators, then write the loop as a for loop with an explicit count.
The specialization using the second optimization looks like this:
template<>
struct __copy_move<false, false, random_access_iterator_tag>
{
template<typename _II, typename _OI>
static _OI
__copy_m(_II __first, _II __last, _OI __result)
{
typedef typename iterator_traits<_II>::difference_type _Distance;
for(_Distance __n = __last - __first; __n > 0; --__n)
{
*__result = *__first;
++__first;
++__result;
}
return __result;
}
};
So, I have two questions concerning this
memmove
increase the spead of copying? Is it implemented somehow more effective than a simple loop?for
loop affect the performance?Some clarification: I would like to see some optimization examples actually used by compilers, not elaboration on the possibility of those.
Edit: the first question is quite nicely answered here.
Upvotes: 1
Views: 102
Reputation: 1584
Answering the second question, the explicit count does indeed lead to more opportunities for loop unrolling, though even with pointers iterating through a fixed size array, gcc does not perform aggressive unrolling unless asked to do so with -funroll-loops
. The other gain comes from a potentially simpler end-of-loop comparison test for non-trivial iterators.
On a Core i7-4770, I benchmarked the time spent performing a copy of a maximally-aligned 2048-long integer array with a while loop and explicit count copy implementation. (Times in microseconds, includes call overhead; minimum of 200 samples of a timing loop with warm-up.)
while count
gcc -O3 0.179 0.178
gcc -O3 -march=native 0.097 0.095
gcc -O3 -march=native -funroll-loops 0.066 0.066
In each case, the generated code is very similar; the while
version does a bit more work at the end in each case, handling checks that there aren't any entries left to copy that didn't fill out a whole 128-bit (SSE) or 256-bit (AVX) register, but these are pretty much taken care of by the branch predictor. The gcc -O3
assembly for each is as follows (leaving out assembler directives). while
version:
array_copy_while(int (&) [2048], int (&) [2048]):
leaq 8192(%rdi), %rax
leaq 4(%rdi), %rdx
movq %rax, %rcx
subq %rdx, %rcx
movq %rcx, %rdx
shrq $2, %rdx
leaq 1(%rdx), %r8
cmpq $8, %r8
jbe .L11
leaq 16(%rsi), %rdx
cmpq %rdx, %rdi
leaq 16(%rdi), %rdx
setae %cl
cmpq %rdx, %rsi
setae %dl
orb %dl, %cl
je .L11
movq %r8, %r9
xorl %edx, %edx
xorl %ecx, %ecx
shrq $2, %r9
leaq 0(,%r9,4), %r10
.L9:
movdqa (%rdi,%rdx), %xmm0
addq $1, %rcx
movdqa %xmm0, (%rsi,%rdx)
addq $16, %rdx
cmpq %rcx, %r9
ja .L9
leaq 0(,%r10,4), %rdx
addq %rdx, %rdi
addq %rdx, %rsi
cmpq %r10, %r8
je .L1
movl (%rdi), %edx
movl %edx, (%rsi)
leaq 4(%rdi), %rdx
cmpq %rdx, %rax
je .L1
movl 4(%rdi), %edx
movl %edx, 4(%rsi)
leaq 8(%rdi), %rdx
cmpq %rdx, %rax
je .L20
movl 8(%rdi), %eax
movl %eax, 8(%rsi)
ret
.L11:
movl (%rdi), %edx
addq $4, %rdi
addq $4, %rsi
movl %edx, -4(%rsi)
cmpq %rdi, %rax
jne .L11
.L1:
rep ret
.L20:
rep ret
count
version:
array_copy_count(int (&) [2048], int (&) [2048]):
leaq 16(%rsi), %rax
movl $2048, %ecx
cmpq %rax, %rdi
leaq 16(%rdi), %rax
setae %dl
cmpq %rax, %rsi
setae %al
orb %al, %dl
je .L23
movw $512, %cx
xorl %eax, %eax
xorl %edx, %edx
.L29:
movdqa (%rdi,%rax), %xmm0
addq $1, %rdx
movdqa %xmm0, (%rsi,%rax)
addq $16, %rax
cmpq %rdx, %rcx
ja .L29
rep ret
.L23:
xorl %eax, %eax
.L31:
movl (%rdi,%rax,4), %edx
movl %edx, (%rsi,%rax,4)
addq $1, %rax
cmpq %rax, %rcx
jne .L31
rep ret
When the iterators are more complicated however, the difference becomes more pronounced. Consider a hypothetical container that stores values in a series of fixed-sized allocated buffers. An iterator comprises a pointer to the chain of blocks, a block index and a block offset. Comparison of two iterators requires potentially two comparisons. Incrementing the iterator requires checking if we pop over a block boundary.
I made such a container, and performed the same benchmark for copying a 2000-long container of int
, with a block size of 512 int
s.
while count
gcc -O3 1.560 2.818
gcc -O3 -march=native 1.660 2.854
gcc -O3 -march=native -funroll-loops 1.432 2.858
That looks weird! Oh wait, it's because gcc 4.8 has a misoptimisation, where it uses conditional moves instead of nice, branch-predictor friendly comparisons. (gcc bug 56309).
Let's try icc on a different machine (Xeon E5-2670).
while count
icc -O3 3.952 3.704
icc -O3 -xHost 3.898 3.624
This is closer to what we'd expect, a small but significant improvement from the simpler loop condition. On a different architecture, the gain is more pronounced. clang targeting a PowerA2 at 1.6GHz:
while count
bgclang -O3 36.528 31.623
I'll omit the assembly, as it's quite long!
Upvotes: 1