C++: Why does this speed my code up?

Question

I have the following function

double single_channel_add(int patch_top_left_row, int patch_top_left_col, 
        int image_hash_key, 
        Mat* preloaded_images,
        int* random_values){

    int first_pixel_row = patch_top_left_row + random_values[0];
    int first_pixel_col = patch_top_left_col + random_values[1];
    int second_pixel_row = patch_top_left_row + random_values[2];
    int second_pixel_col = patch_top_left_col + random_values[3];

    int channel = random_values[4];

    Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr(first_pixel_row, first_pixel_col);
    Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr(second_pixel_row, second_pixel_col);

    return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}

Which is called about one and a half million times with different values for patch_top_left_row and patch_top_left_col. This takes about 2 seconds to run, now when I change the calculation of first_pixel_row etc to not use the arguments but hard coded numbers instead (shown below), the thing runs sub second and I don't know why. Is the compiler doing something smart here ( I am using gcc cross compiler)?

double single_channel_add(int patch_top_left_row, int patch_top_left_col, 
        int image_hash_key, 
        Mat* preloaded_images,
        int* random_values){

        int first_pixel_row = 5 + random_values[0];
        int first_pixel_col = 6 + random_values[1];
        int second_pixel_row = 8 + random_values[2];
        int second_pixel_col = 10 + random_values[3];
            int channel = random_values[4];

    Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr(first_pixel_row, first_pixel_col);
    Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr(second_pixel_row, second_pixel_col);

    return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}

EDIT:

I have pasted the assembly from the two versions of the function using arguments: http://pastebin.com/tpCi8c0F using constants: http://pastebin.com/bV0d7QH7

EDIT:

After compiling with -O3 I get the following clock ticks and speeds:

using arguments: 1990000 ticks and 1.99seconds using constants: 330000 ticks and 0.33seconds

EDIT: using argumenst with -03 compilation: http://pastebin.com/fW2HCnHc using constant with -03 compilation: http://pastebin.com/FHs68Agi

Omnifarious · Accepted Answer

On the x86 platform there are instructions that very quickly add small integers to a register. These instructions are the lea (aka 'load effective address') instructions and they are meant for computing address offsets for structures and the like. The small integer being added is actually part of the instruction. Smart compilers know that these instructions are very quick and use them for addition even when addresses are not involved.

I bet if you changed the constants to some random value that was at least 24 bits long that you would see much of the speedup disappear.

Secondly those constants are known values. The compiler can do a lot to arrange for those values to end up in a register in the most efficient way possible. With an argument, unless the argument is passed in a register (and I think your function has too many arguments for that calling convention to be used) the compiler has no choice but to fetch the number from memory using a stack offset load instruction. That isn't a particularly slow instruction or anything, but with constants the compiler is free to do something much faster than may involve simply fetching the number from the instruction itself. The lea instructions are simply the most extreme example of this.

Edit: Now that you've pasted the assembly things are much clearer

In the non-constant code, here is how the add is done:

addl    -68(%rbp), %eax

This fetches a value from the stack an offset -68(%rpb) and adds it to the %eax% register.

In the constant code, here is how the add is done:

addl    $5, %eax

and if you look at the actual numbers, you see this:

0138 83C005

It's pretty clear that the constant being added is encoded directly into the instruction as a small value. This is going to be much faster to fetch than fetching a value from a stack offset for a number of reasons. First it's smaller. Secondly, it's part of an instruction stream with no branches. So it will be pre-fetched and pipelined with no possibility for cache stalls of any kind.

So while my surmise about the lea instruction wasn't correct, I was still on the right track. The constant version uses a small instruction specifically oriented towards adding a small integer to a register. The non-constant version has to fetch an integer that may be of indeterminate size (so it has to fetch ALL the bits, not just the low ones) from a stack offset (which adds in an additional add to compute the actual address from the offset and stack base address).

Edit 2: Now that you've posted the -O3 results

Well, it's much more confusing now. It's apparently inlined the function in question and it jumps around a whole ton between the code for the inlined function and the code for the calling function. I'm going to need to see the original code for the whole file to make a proper analysis.

But what I strongly suspect is happening now is that the unpredictability of the values retrieved from get_random_number_in_range is severely limiting the optimization options available to the compiler. In fact, it looks like in the constant version it doesn't even bother to call get_random_number_in_range because the value is tossed out and never used.

I'm assuming that the values of patch_top_left_row and patch_top_left_col are generated in a loop somewhere. I would push this loop into this function. If the compiler knows the values are generated as part of a loop, there are a very large number of optimization options open to it. In the extreme case it could use some of the SIMD instructions that are part of the various SSE or 3dnow! instruction suites to make things a whole ton faster than even the version you have that uses constants.

The other option would be to make this function inline, which would hint to the compiler that it should try inserting it into the loop in which it's called. If the compiler takes the hint (this function is a bit largish, so the compiler might not) it will have much the same effect as if you'd stuffed the loop into the function.

C++: Why does this speed my code up?

Answers (2)

Related Questions