Reputation: 179
I have a row-major iterator to a 2D array, with derefence operator as follows:
int& Iterator::operator*(){ return matrix_[y_][x_]; } //matrix_ has type int**
The (prefix) increment operator is as follows:
Iterator& Iterator::operator++()
{
if((++x_ == xs_) && (++y_ != ys_)) //ys_, xs_ are the dimensions of the matrix
x_ = 0;
return *this;
}
I can use this iterator with an optimised version of std::transform (doesn't return an un-needed result, in order to save a few instructions)
template < class InputIterator, class OutputIterator, class UnaryOperator >
inline void MyTransform( InputIterator first1, InputIterator last1,
OutputIterator result, UnaryOperator op )
{
for (; first1 != last1; ++first1, ++result)
*result = op(*first1);
}
calling it thus:
MyTransform(matrix1.begin(),matrix1.end(),matrix2.begin(), MyFunctor());
However, when I compare the performance to a classic, nested for-loop:
MyFunctor() f;
for (int y=0; y<ySize; y++)
for (int x=0; x<xSize; x++)
matrix2.[y][x] = f(matrix1.[y][x]);
The iterator-based solution is approx. 25% slower than the nested for-loop solution. This is the case for both MSVC and Intel C++ compilers (both of which seem to auto-inline as needed).
Now the problem does not seem to be with the iterator increment operator, as if I do the following (ugly) hybrid solution combining iterator-traversal and raw array access (the latter indexed using the iterators' internal counts):
MyFunctor f;
for (; mat1Begin != mat1End; ++mat1Begin, ++mat2Begin)
{
//mat1 and mat2 are type int**
mat2[mat2Begin.y_][mat2Begin.x_] = f(mat1[mat1Begin.y_][mat1Begin.x_]);
}
it is actually a little faster than the nested for-loop solution. This suggests to me that the performance hit is in the iterator's dereferencing when doing the assignment.
My question is, why does dereferencing the iterators in the assignment
*result = op(*first1);
incur such a massive performance hit, relative to raw array access? Is there any technique I can use for this simple design, to get performance (almost) equivalent to the raw array version?
In response to helpful feedback from this community, I have modified the code so that the outer counter of the loop is cached, so the code now looks as follows:
int& Iterator::operator*()
{
return column_[x_];
}
Iterator& Iterator::operator++()
{
if(++x_ == xs_) //ys_, xs_ are the dimensions of the matrix
{
if(++y_ != ys_)
{
x_ = 0;
column_ = matrix_[y_];
}
}
return *this;
}
This improves performance to ~85% of the raw 2D array performance for the Intel C++ Compiler, and similar for MSVC compiler (actually the call to the MyTransform is slower on MSVC - much more assembly instructions generated - but let's ignore that for now as I am interested more in the loop/dereference behaviour).
When I convert the code to using pointer arithmetic (again caching the column) as follows, the performance is significantly worse than the raw 2D array (~70%) on the Intel compiler, but again ~85% of raw 2D array under MSVC compiler
int& Image32Iterator::operator*()
{
return *ptr_;
}
//prefix
Image32Iterator& Image32Iterator::operator++()
{
if(++ptr_ == ptrEnd_)
{
if(++row_ != rowEnd_)
{
ptrEnd_ = (ptr_ = *row_) + xs_;
}
}
return *this;
}
So I am trying to understand if ~85% performance is the maximum I can get using the iterator-based solution. I am surprised that the pointer arithmetic solution performs so much worse (galling as I was trying to use pointer arithmetic to see if I could get > 85%!).
I'll continue investigations and update with finds, but any insight welcome...
...So, focussing on the issue of why the pointer-arithmetic version of the iterator performs so badly for Intel, whereas it performs okay for the MSVC compiler, I have looked at the assembly, and the problem seems to be in the code generated for the loop. For all the other functions (i.e., the constructors, the iterator and dereference operators, inequality operator etc., the generated code is pretty much the same for both Intel and MSVC, if anything it is slightly more concise for the Intel).
Here is the assembler for the Intel generated code, followed by the assembler for the MSVC generated code. I have changed from a for loop to a while loop to make the generated assembler easier to read:
Intel generated code:
while(begin != end)
01392D31 push eax
01392D32 lea eax,[begin]
01392D35 lea edx,[end]
01392D38 mov dword ptr [esp],edx
01392D3B mov ecx,eax
01392D3D call ImageExperiments::Image32Iterator::operator!= (139103Ch)
01392D42 mov byte ptr [ebp-74h],al
01392D45 movzx eax,byte ptr [ebp-74h]
01392D49 movzx eax,al
01392D4C test eax,eax
01392D4E je ImageExperiments::greyscale_iterator2+0BCh (1392DACh)
{
*it8 = gsf(*begin);
01392D50 lea eax,[begin]
01392D53 mov ecx,eax
01392D55 call ImageExperiments::Image32Iterator::operator* (13910A5h)
01392D5A mov dword ptr [ebp-10h],eax
01392D5D push eax
01392D5E lea eax,[gsf]
01392D61 mov edx,dword ptr [ebp-10h]
01392D64 mov edx,dword ptr [edx]
01392D66 mov dword ptr [esp],edx
01392D69 mov ecx,eax
01392D6B call ImageExperiments::GreyScaleFunctor::operator() (139101Eh)
01392D70 mov byte ptr [ebp-72h],al
01392D73 movzx eax,byte ptr [ebp-72h]
01392D77 mov byte ptr [ebp-71h],al
01392D7A lea eax,[it8]
01392D7D mov ecx,eax
01392D7F call ImageExperiments::Image8Iterator::operator* (1391050h)
01392D84 mov dword ptr [ebp-0Ch],eax
01392D87 mov eax,dword ptr [ebp-0Ch]
01392D8A movzx edx,byte ptr [ebp-71h]
01392D8E mov byte ptr [eax],dl
++begin;
01392D90 lea eax,[begin]
01392D93 mov ecx,eax
01392D95 call ImageExperiments::Image32Iterator::operator++ (1391028h)
01392D9A mov dword ptr [ebp-8],eax
++it8;
01392D9D lea eax,[it8]
01392DA0 mov ecx,eax
01392DA2 call ImageExperiments::Image8Iterator::operator++ (1391014h)
01392DA7 mov dword ptr [ebp-4],eax
01392DAA jmp ImageExperiments::greyscale_iterator2+41h (1392D31h)
}
}
00CA2DAC leave
00CA2DAD ret
MSVC generated code:
while(begin != end)
010316E0 lea eax,[end]
010316E3 push eax
010316E4 lea ecx,[begin]
010316E7 call ImageExperiments::Image32Iterator::operator!= (1031096h)
010316EC movzx ecx,al
010316EF test ecx,ecx
010316F1 je ImageExperiments::greyscale_iterator2+74h (1031724h)
{
*it8 = gsf(*begin);
010316F3 lea ecx,[begin]
010316F6 call ImageExperiments::Image32Iterator::operator* (10311EAh)
010316FB mov eax,dword ptr [eax]
010316FD push eax
010316FE lea ecx,[gsf]
01031701 call ImageExperiments::GreyScaleFunctor::operator() (1031032h)
01031706 mov bl,al
01031708 lea ecx,[it8]
0103170B call ImageExperiments::Image8Iterator::operator* (1031118h)
01031710 mov byte ptr [eax],bl
++begin;
01031712 lea ecx,[begin]
01031715 call ImageExperiments::Image32Iterator::operator++ (1031041h)
++it8;
0103171A lea ecx,[it8]
0103171D call ImageExperiments::Image8Iterator::operator++ (103101Eh)
}
01031722 jmp ImageExperiments::greyscale_iterator2+30h (10316E0h)
}
01031724 pop edi
01031725 pop esi
01031726 pop ebx
01031727 mov esp,ebp
01031729 pop ebp
0103172A ret
So it looks to me like the Intel compiler generates approx. 50% larger number of instructions. I have tried qualifying the pointers with __restrict to see if that would make any difference to the Intel generation but it did not. If anyone has any suggestion as to why the loop code from the Intel compiler is so bulky/slow, compared to the MSVC++ compiler, I'd be very interested!
Upvotes: 2
Views: 1848
Reputation: 6740
I've had a go at recreating your code, see here.
Running it under g++ (4.6.3, -O3), I find that:
1) The non-iterator version is indeed faster, but in my case by about a factor of 4. 2) The iterator version, whether or not you then deference the iterators or extract their counters and use them to access the array directly, is slower (by the aforementioned factor).
I've captured the assembler for the two versions, and find them to be quite different with a significant amount of code associated with the iterator incrementing logic in version 2. Note that everything is inlined in both cases.
Case 1 inner loop, no iterators:
.L18:
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L19:
movq 24(%rsp), %rdx
movq 40(%rsp), %rsi
movq (%rdx,%rcx), %rdx
movq (%rsi,%rcx), %rsi
movl (%rdx,%rax), %edx
imull %edx, %edx
movl %edx, (%rsi,%rax)
addq $4, %rax
cmpq $20000, %rax
jne .L19
addq $8, %rcx
cmpq $40000, %rcx
jne .L18
movl $.LC2, %esi
movl std::cout, %edi
Case 2 inner loop, iterators:
.L34:
movl %eax, 56(%rsp)
movl %ecx, 60(%rsp)
movl %edi, 72(%rsp)
movl %edi, 76(%rsp)
movq 72(%rsp), %rdi
cmpq %rdi, 56(%rsp)
je .L36
movq 24(%rsp), %rdi
movslq %eax, %r10
movslq %ecx, %r9
movslq %edx, %r11
addl $1, %eax
movq (%rdi,%r10,8), %rdi
movslq %esi, %r10
movl (%rdi,%r9,4), %edi
movq 40(%rsp), %r9
imull %edi, %edi
movq (%r9,%r11,8), %r9
movl %edi, (%r9,%r10,4)
movl 16(%rsp), %edi
cmpl %edi, %eax
je .L37
.L20:
addl $1, %edx
cmpl 32(%rsp), %edx
jne .L34
addl $1, %esi
cmpl %esi, %edx
cmovne %r8d, %edx
jmp .L34
.L37:
addl $1, %ecx
cmpl %ecx, %eax
cmovne %r8d, %eax
jmp .L20
.L36:
Ultimately, I think the best advice, if you like the iterator pattern is to redefine your matrix internal array as an int*
, allowing the iterator to be a simple wrapper around a pointer. This is obviously at the expense of the random-access indexing of the matrix which will need to deal with calculating the 1-d offset into the int
array given x
, y
and row width (although hardly rocket science!).
Upvotes: 2
Reputation: 1756
I think your iterator is too big. When you call operator*()
worst case is that your compiler needs to fetch y_
and x_
before it can fetch the value of matrix_
at x_
, y_
. I would try to use raw pointers as iterators when possible. That means when matrix_
is defined as int matrix_[N][M]
you can use &matrix_[0][0]
as begin and &matrix_[N-1][M]
as end for iteration. And of course, there's always valarray
.
Upvotes: 1
Reputation: 157374
You're hoisting the double indirection matrix_[y_][x_]
from a function call into the loop. Possibly the compiler is managing to cache the pointer matrix_[y_]
in one case but not the other; could you try caching matrix_[y_]
in the iterator?
Upvotes: 0
Reputation: 5262
1. Memory localization. Guaranteed contiguous. I noticed you clarified that variables mat1, and mat2, are int**. But how is matrix_ handled in memory. Interators just point anywhere conceivably. Is your memory for matrix_ localized? Heap based multidimensional arrays may not be contiguous. But Vector<> is.
This line of code isn't using the actual interators, but using their variables to index an array that is localized.
mat2[mat2Begin.y_][mat2Begin.x_] = f(mat1[mat1Begin.y_][mat1Begin.x_]);
2. You're forgetting optimizations. In the second usage of using the incrementing operator, you're doing that step before calling the functor.
This may mean that calling the functor passing it an object that is dereferenced through the operator is interfering with the optimizers ability to prioritize order.
Try storing off the dereferenced object before calling op() on it, and see if that eliminates the cost.
for (; first1 != last1; ++first1, ++result)
{
InputIterator::value_type val = *first1;
*result = op(val);
}
I've seen some funky things with using operators on argument assignment. To the point where it's deferred resolving until after the call even (sending in some other interpretation of the expression, and resolving the expression after the call), and argument resolution order is not guaranteed. It's best to send the actual intended object through the argument if you have efficiency problems.
Upvotes: 0