Daniel Bencik
Daniel Bencik

Reputation: 969

Performance issue C++ - searching through an array

I have two versions of searching through an int array for a specific value.

The first version is the straight forward one

int FindWithoutBlock(int * Arr, int ArrLen, int Val)
{
    for ( int i = 0; i < ArrLen; i++ )
        if ( Arr[i] == Val )
          return i;
 return ArrLen;
}

The second version should be faster. The passed array needs to be one element larger than in the previous case. Say for an array with 5 values, you allocate six ints and then do the following

int FindWithBlock(int * Arr, int LastCellIndex, int Val)
{
    Arr[LastCellIndex]  = Val;

    int i;
    for ( i = 0 ; Arr[i] != Val; i++ );
    return i;
}

this version should be faster - you don't need to check array boundaries with each iteration through Arr.

Now the "issue". When running these functions 100K times on an array of 100K elements in Debug, the second version is roughly 2x faster. In Release however, the first version is approximately 6000x faster. And the question is why.

A program that demonstrates this is to be found at http://eubie.sweb.cz/main.cpp

Any insight is much appreciated. Daniel

Upvotes: 8

Views: 438

Answers (7)

David Hammen
David Hammen

Reputation: 33136

This is more of an extended comment than an answer. Skizz already answered the question with "Aha! The FindWithoutBlock function is only being called once!"

Test driver
I typically tend to put the code for the test driver and the test article in separate files. For one thing, you aren't going to deliver the test driver. For another, combining them like you did lets the optimizer do things you really do not want to be done such as calling the function once rather than 100,000 times. Separating them lets you use different optimization levels for the driver and test article. I tend to compile the driver unoptimized so that the loop that does the same thing 100K times truly is executed 100K times. The test article on the other hand is compiled with the optimization expected for the release.

Use of getchar()
It's usually a bad idea to use any I/O inside the test loop when testing for CPU utilization. Your test code is calling getchar when the item to be found is not in the array. [Rest of faulty analysis elided.] Update: Your test code calls getchar when the item to be found is in the array. Even though your test code ensures the item will not be found (and hence getchar won't be called) it's still not a good idea to have that call. Do something fast and benign instead.

C versus C++
Your code looks more like C± rather than C++. You are using malloc rather than new, you are intermingling C and C++ I/O, and you aren't using the C++ library such as std::find. This is typical for someone moving from C to C++. It's good to be aware of things like std::find. This allows you to completely eliminate your FindWithoutBlock function.

Premature optimization
The only reason to use that FindWithBlock formulation is because this search is a bottleneck. So is this truly a bottleneck? The FindWithoutBlock formulation (and even better, std::find) is arguably a better way to go because you do not need to modify the array and hence the array argument can be marked as const. The array cannot be marked as such with FindWithBlock because you are modifying the array.

Upvotes: 2

Matthieu M.
Matthieu M.

Reputation: 300429

Your compiler is smart.

If you use the LLVM Try Out page, you will obtain the following IR generated:

define i32 @FindWithoutBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable readonly

define i32 @FindWithBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable

The only difference is the presence of the readonly attribute on the first function:

From the Language Reference page:

readonly

This attribute indicates that the function does not write through any pointer arguments (including byval arguments) or otherwise modify any state (e.g. memory, control registers, etc) visible to caller functions. It may dereference pointer arguments and read state that may be set in the caller. A readonly function always returns the same value (or unwinds an exception identically) when called with the same set of arguments and global state. It cannot unwind an exception by calling the C++ exception throwing methods.

It means that, potentially, the optimizer may realise that the function will always return the same computation (for a given loop) and hoist it outside the loop.

Upvotes: 0

Skizz
Skizz

Reputation: 71130

Here are my results using DevStudio 2005:

Debug:

  • Without block: 25.109
  • With block: 19.703

Release:

  • Without block: 0
  • With block: 6.046

It is very important to run this from the command line and not from within DevStudio, DevStudio does something to affect the performance of the app.

The only way to know what's really happening is to look at the assembler code. Here's the assembler generated in release:-

FindWithoutBlock:
00401000  xor         eax,eax 
00401002  cmp         dword ptr [ecx+eax*4],0F4240h 
00401009  je          FindWithoutBlock+1Ah (40101Ah) 
0040100B  add         eax,1 
0040100E  cmp         eax,186A0h 
00401013  jl          FindWithoutBlock+2 (401002h) 
00401015  mov         eax,186A0h 
0040101A  ret              

Note that the compiler has removed the ArrLen parameter and replaced it with a constant! It has also kept it as a function.

Here's what the compiler did with the other function (FindWithBlock):-

004010E0  mov         dword ptr [esp+38h],186A0h 
004010E8  mov         ebx,0F4240h 
004010ED  mov         dword ptr [esi+61A80h],ebx 
004010F3  xor         eax,eax 
004010F5  cmp         dword ptr [esi],ebx 
004010F7  je          main+0EFh (40110Fh) 
004010F9  lea         esp,[esp] 
00401100  add         eax,1 
00401103  cmp         dword ptr [esi+eax*4],ebx 
00401106  jne         main+0E0h (401100h) 
00401108  cmp         eax,186A0h 
0040110D  je          main+0F5h (401115h) 
0040110F  call        dword ptr [__imp__getchar (4020D0h)] 
00401115  sub         dword ptr [esp+38h],1 
0040111A  jne         main+0CDh (4010EDh) 

Here, the function has been in-lined. The lea esp,[esp] is just a 7 byte nop to align the next instruction. The code checks index 0 separately to all the other indices but the main loop is definately tighter than the FindWithoutBlock version.

Hmmm. Here's the code that calls FindWithoutBlock:-

0040106F  mov         ecx,edi 
00401071  mov         ebx,eax 
00401073  call        FindWithoutBlock (401000h) 
00401078  mov         ebp,eax 
0040107A  mov         edi,186A0h 
0040107F  cmp         ebp,186A0h 
00401085  je          main+6Dh (40108Dh) 
00401087  call        dword ptr [__imp__getchar (4020D0h)] 
0040108D  sub         edi,1 
00401090  jne         main+5Fh (40107Fh) 

Aha! The FindWitoutBlock function is only being called once! The compiler has spotted that the function will return the same value every time and has optimised it to a single call. In the FindWithBlock, the compiler can't make the same assumption because you write to the array before the search, thus the array is (potentially) different for each call.

To test this, add the volatile keyword like this:-

int FindWithoutBlock(volatile int * Arr, int ArrLen, int Val)
{
    for ( int i = 0; i < ArrLen; i++ )
        if ( Arr[i] == Val )
            return i;

    return ArrLen;
}

int FindWithBlock(volatile int * Arr, int LastCellIndex, int Val)
{
    Arr[LastCellIndex]  = Val;

    int i;
    for ( i = 0 ; Arr[i] != Val; i++ );

    return i;
}

Doing this, both versions run in similar time (6.040). Seeing as the memory access is a major bottleneck, the more complex tests of the FindWithoutBlock don't impact on the overall speed.

Upvotes: 8

Puppy
Puppy

Reputation: 147056

First, ewwwwww disgusting C garbage. std::find and iterators?

But secondly, the compiler's optimizer is written to recognize the first form- not the second. It may be, for example, inlined, unrolled, or vectorized, whereas the second cannot be.

In the general case, consider the cache issue. You are touching the end of the array and then going to the beginning- this could be a cache miss. However in the first block you are cheerily going only sequentially through the array- more cache coherent.

Upvotes: 2

Alex Bakulin
Alex Bakulin

Reputation: 1668

In the first example there are two conditions checked at every iteration: i < ArrLen and Arr[i] == Val. In the second example there's only one condition to check. That's why the first loop is twice as slower.

I can't observe the same behavior using GCC: the first loop is still slower.

With -O0:

Without block: 25.83
With block: 20.35

With -O3:

Without block: 6.33
With block: 4.75

I guess that the compiler somehow deduced that there is no SearchVal in array and thus there's no reason to call a function which searches for it.

Upvotes: 0

prathmesh.kallurkar
prathmesh.kallurkar

Reputation: 5696

The first for .. loop contains two conditions for each iteration while the second for loop contains one iteration per loop. For a large number of iterations, this difference should show because there is a RAW dependency between second condition and the iterator increment. But I still don't think that the speedup should be so high.

Upvotes: 0

user82238
user82238

Reputation:

What I observe is that in the first case, the compiler knows at run-time the size of the loop (e.g. < ArrLen). In the second case, the compiler cannot know.

Upvotes: 0

Related Questions