Reputation: 969
I have two versions of searching through an int array for a specific value.
The first version is the straight forward one
int FindWithoutBlock(int * Arr, int ArrLen, int Val)
{
for ( int i = 0; i < ArrLen; i++ )
if ( Arr[i] == Val )
return i;
return ArrLen;
}
The second version should be faster. The passed array needs to be one element larger than in the previous case. Say for an array with 5 values, you allocate six ints and then do the following
int FindWithBlock(int * Arr, int LastCellIndex, int Val)
{
Arr[LastCellIndex] = Val;
int i;
for ( i = 0 ; Arr[i] != Val; i++ );
return i;
}
this version should be faster - you don't need to check array boundaries with each iteration through Arr.
Now the "issue". When running these functions 100K times on an array of 100K elements in Debug, the second version is roughly 2x faster. In Release however, the first version is approximately 6000x faster. And the question is why.
A program that demonstrates this is to be found at http://eubie.sweb.cz/main.cpp
Any insight is much appreciated. Daniel
Upvotes: 8
Views: 438
Reputation: 33136
This is more of an extended comment than an answer. Skizz already answered the question with "Aha! The FindWithoutBlock function is only being called once!"
Test driver
I typically tend to put the code for the test driver and the test article in separate files. For one thing, you aren't going to deliver the test driver. For another, combining them like you did lets the optimizer do things you really do not want to be done such as calling the function once rather than 100,000 times. Separating them lets you use different optimization levels for the driver and test article. I tend to compile the driver unoptimized so that the loop that does the same thing 100K times truly is executed 100K times. The test article on the other hand is compiled with the optimization expected for the release.
Use of getchar()
It's usually a bad idea to use any I/O inside the test loop when testing for CPU utilization. Your test code is calling getchar when the item to be found is not in the array. [Rest of faulty analysis elided.] Update: Your test code calls getchar
when the item to be found is in the array. Even though your test code ensures the item will not be found (and hence getchar
won't be called) it's still not a good idea to have that call. Do something fast and benign instead.
C versus C++
Your code looks more like C± rather than C++. You are using malloc
rather than new
, you are intermingling C and C++ I/O, and you aren't using the C++ library such as std::find
. This is typical for someone moving from C to C++. It's good to be aware of things like std::find
. This allows you to completely eliminate your FindWithoutBlock
function.
Premature optimization
The only reason to use that FindWithBlock
formulation is because this search is a bottleneck. So is this truly a bottleneck? The FindWithoutBlock
formulation (and even better, std::find
) is arguably a better way to go because you do not need to modify the array and hence the array argument can be marked as const
. The array cannot be marked as such with FindWithBlock
because you are modifying the array.
Upvotes: 2
Reputation: 300429
Your compiler is smart.
If you use the LLVM Try Out page, you will obtain the following IR generated:
define i32 @FindWithoutBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable readonly
define i32 @FindWithBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable
The only difference is the presence of the readonly
attribute on the first function:
From the Language Reference page:
readonly
This attribute indicates that the function does not write through any pointer arguments (including byval arguments) or otherwise modify any state (e.g. memory, control registers, etc) visible to caller functions. It may dereference pointer arguments and read state that may be set in the caller. A readonly function always returns the same value (or unwinds an exception identically) when called with the same set of arguments and global state. It cannot unwind an exception by calling the C++ exception throwing methods.
It means that, potentially, the optimizer may realise that the function will always return the same computation (for a given loop) and hoist it outside the loop.
Upvotes: 0
Reputation: 71130
Here are my results using DevStudio 2005:
Debug:
Release:
It is very important to run this from the command line and not from within DevStudio, DevStudio does something to affect the performance of the app.
The only way to know what's really happening is to look at the assembler code. Here's the assembler generated in release:-
FindWithoutBlock:
00401000 xor eax,eax
00401002 cmp dword ptr [ecx+eax*4],0F4240h
00401009 je FindWithoutBlock+1Ah (40101Ah)
0040100B add eax,1
0040100E cmp eax,186A0h
00401013 jl FindWithoutBlock+2 (401002h)
00401015 mov eax,186A0h
0040101A ret
Note that the compiler has removed the ArrLen parameter and replaced it with a constant! It has also kept it as a function.
Here's what the compiler did with the other function (FindWithBlock):-
004010E0 mov dword ptr [esp+38h],186A0h
004010E8 mov ebx,0F4240h
004010ED mov dword ptr [esi+61A80h],ebx
004010F3 xor eax,eax
004010F5 cmp dword ptr [esi],ebx
004010F7 je main+0EFh (40110Fh)
004010F9 lea esp,[esp]
00401100 add eax,1
00401103 cmp dword ptr [esi+eax*4],ebx
00401106 jne main+0E0h (401100h)
00401108 cmp eax,186A0h
0040110D je main+0F5h (401115h)
0040110F call dword ptr [__imp__getchar (4020D0h)]
00401115 sub dword ptr [esp+38h],1
0040111A jne main+0CDh (4010EDh)
Here, the function has been in-lined. The lea esp,[esp]
is just a 7 byte nop to align the next instruction. The code checks index 0 separately to all the other indices but the main loop is definately tighter than the FindWithoutBlock version.
Hmmm. Here's the code that calls FindWithoutBlock:-
0040106F mov ecx,edi
00401071 mov ebx,eax
00401073 call FindWithoutBlock (401000h)
00401078 mov ebp,eax
0040107A mov edi,186A0h
0040107F cmp ebp,186A0h
00401085 je main+6Dh (40108Dh)
00401087 call dword ptr [__imp__getchar (4020D0h)]
0040108D sub edi,1
00401090 jne main+5Fh (40107Fh)
Aha! The FindWitoutBlock function is only being called once! The compiler has spotted that the function will return the same value every time and has optimised it to a single call. In the FindWithBlock, the compiler can't make the same assumption because you write to the array before the search, thus the array is (potentially) different for each call.
To test this, add the volatile
keyword like this:-
int FindWithoutBlock(volatile int * Arr, int ArrLen, int Val)
{
for ( int i = 0; i < ArrLen; i++ )
if ( Arr[i] == Val )
return i;
return ArrLen;
}
int FindWithBlock(volatile int * Arr, int LastCellIndex, int Val)
{
Arr[LastCellIndex] = Val;
int i;
for ( i = 0 ; Arr[i] != Val; i++ );
return i;
}
Doing this, both versions run in similar time (6.040). Seeing as the memory access is a major bottleneck, the more complex tests of the FindWithoutBlock don't impact on the overall speed.
Upvotes: 8
Reputation: 147056
First, ewwwwww disgusting C garbage. std::find
and iterators?
But secondly, the compiler's optimizer is written to recognize the first form- not the second. It may be, for example, inlined, unrolled, or vectorized, whereas the second cannot be.
In the general case, consider the cache issue. You are touching the end of the array and then going to the beginning- this could be a cache miss. However in the first block you are cheerily going only sequentially through the array- more cache coherent.
Upvotes: 2
Reputation: 1668
In the first example there are two conditions checked at every iteration: i < ArrLen
and Arr[i] == Val
. In the second example there's only one condition to check. That's why the first loop is twice as slower.
I can't observe the same behavior using GCC: the first loop is still slower.
With -O0
:
Without block: 25.83
With block: 20.35
With -O3
:
Without block: 6.33
With block: 4.75
I guess that the compiler somehow deduced that there is no SearchVal
in array and thus there's no reason to call a function which searches for it.
Upvotes: 0
Reputation: 5696
The first for .. loop contains two conditions for each iteration while the second for loop contains one iteration per loop. For a large number of iterations, this difference should show because there is a RAW dependency between second condition and the iterator increment. But I still don't think that the speedup should be so high.
Upvotes: 0
Reputation:
What I observe is that in the first case, the compiler knows at run-time the size of the loop (e.g. < ArrLen). In the second case, the compiler cannot know.
Upvotes: 0