QuantumKarl
QuantumKarl

Reputation: 499

Could this alternative way to loop be more effcient?

I was bored one rainy afternoon and came up with this:

int ia_array[5][5][5]; //interger array called array

{
        int i = 0, j = 0, k = 0;//counters
        while( i < 5 )//loop conditions
        {
            ia_array[i][j][k] = 0;//do something
            __asm inc k;//++k;

            if( k > 4)
            {
                __asm inc j;  //++j;
                __asm mov k,0;///k = 0;
            }
            if( j > 4)
            {
                __asm inc i;  //++i;
                __asm mov j,0;//j = 0;
            }
        }//end of while
    }//i,j,k fall out of scope

its functionally equivalent to three nested for loops. However in a for loop you cannot use __asm statements. Also you have the option to not put the counters in a scope so you can reuse them for other loops. I have looked at the disassembly for both and my alternative has 15 opcodes and the nested for loops have 24. Therefore is it potentially faster? suppose I'm really asking is __asm inc i; faster then ++i;?

note: i don't intent to use this code in any projects, just out of curiosity. thanks for your time.

Upvotes: 3

Views: 1531

Answers (5)

Jerry Coffin
Jerry Coffin

Reputation: 490098

While it certainly is possible to beat a compiler at optimization, you're not going to do it this way. The bits you've written in assembly language are pretty obvious, mechanical types of translations that any half-way decent compiler (or even a pretty lousy one) can do easily.

If you want to beat the compiler, you need to go a lot further, such as rearranging instructions to allow more to execute in parallel (decidedly non-trivial) or finding a better sequence of instructions than the compiler can.

In this case, for example, you might at least stand a chance by noting that iarray[5][5][5] can (from an assembly language viewpoint) be treated as a single, flat array of 5*5*5 = 125 elements, and encode most of what's essentially a memset into a single instruction:

mov ecx, 125    // 125 elements
xor eax, eax    // set them to zero
mov di, offset ia_array // where we're going to store them
rep stosd       // and fill that memory.

Realistically, however, this probably isn't going to be a major (or probably even minor) improvement over what the compiler is likely to generate. It's more likely close to the minimum necessary to (at least nearly) keep up.

The next step would be to consider using non-temporal stores instead of a simple stosd. This won't actually speed up this loop (much, anyway), but it might gain some speed overall by avoiding this store polluting the cache if it's possible that other code already in the cache is more important immediately. You could also use some of the other SSE instructions to gain a little speed -- but even at best, you can't expect much better than a couple of percent out of this. The bottom line is that for zeroing some memory, the speed is limited primarily by the bus speed, not the instructions you use, so nothing you do is likely to help much.

Upvotes: 1

James
James

Reputation: 9278

First off, your compiler will likely store the values of i, j and k in registers.

It's more efficient to do for (i = 4; i <=0; i--) than for(i = 0; i < 5; i++) as the cpu can determine if the result of the last operation it executed was zero for free - it doesn't have to explicitly compare to 4 (see the cmovz instruction).

It's the not the case for x86 that having to execute less instruction will lead to faster code. There are all sorts of issues to do with instruction pipelining that quickly get too much for a programmer to write by hand. Leave it to the compiler, they're sufficiently efficient these days (though definitely not optimal... but who wants to wait hours for their code to compile).

You can check it out yourself by running your function a few hundred thousand times with each implementation and check which is faster. Check if you can write asm instructions in for loops with

__asm {
    inc j;
    mov k, 0;
}

(it's been a while since I did this)

P.S. Have fun experimenting with asm, it can be very interesting and rewarding!

Upvotes: 2

Todd Gamblin
Todd Gamblin

Reputation: 59827

Several things:

  1. You can't judge the speed of assembly code based on the number of opcodes in the output. Compilers can unroll loops to eliminate branches, and many modern compilers will attempt to vectorize a loop like the one above. The former could have more opcodes than naive code and be faster, and the latter could have fewer and be faster.

  2. By putting __asm statements in your code, you're probably precluding any optimizations the compiler could do on the loop. So if you compiled this with something really fast like, say, the Intel compilers, then you will likely get worse performance with your code than with the compiler. This is especially true for something as simple as your code here, where the array sizes are known statically and the loop bounds are constant.

If you really want to get a sense of what compilers can/can't do, grab a book or take a course on optimizing compilers and vectorization. There are tons of different optimizations and understanding the performance of even a simple piece of code like this on a particular architecture can be subtle.

There are plenty of kernels and number crunching codes where compilers still can't do better than knowledgable humans, but without a lot of experience with architecture details you're not going to do much better than icc -fast or xlC -O5.

Upvotes: 1

This is going to be very compiler and compiler switch specific, but your code will have three tests per loop iteration where a traditional nested loop would only have one per inner-most loop iteration, so I think your approach would tend to be slower in general.

Upvotes: 1

Puppy
Puppy

Reputation: 146910

No, it won't be even remotely faster. Infact, it could quite easily be slower. Your compiler's optimizer is almost certainly more effective at this than you are.

Upvotes: 1

Related Questions