Elad Weiss
Elad Weiss

Reputation: 4002

A faster way to get a value based on condition then ternary operator?

Here is what I'm trying to acheive. It's simple enough:

unsigned int foo1(bool cond, unsigned int num)
{
    return cond ? num : 0;
}

Assmebly:

    test    dil, dil
    mov     eax, 0
    cmovne  eax, esi
    ret

My question is, is there a faster way to do it? Here are some ways I thought of:

Using multiplication:

unsigned int foo2(bool cond, unsigned int num)
{
    return cond * num;
}

Assmbly:

    movzx   eax, dil
    imul    eax, esi
    ret

Using memory access:

unsigned int foo3(bool cond, unsigned int num)
{
    static const unsigned int masks[2] = { 0x0, 0xFFFFFFFF };
    return masks[cond] & num;
}

Assembly:

    movzx   edi, dil
    mov     eax, DWORD PTR foo3(bool, unsigned int)::masks[0+rdi*4]
    and     eax, esi
    ret

Using some bit tricks:

unsigned int foo4(bool cond, unsigned int num) 
{
    return (0 - (unsigned)cond) & num;
}

Assembly:

    movzx   eax, dil
    neg     eax
    and     eax, esi
    ret

Now, multiplication yields the least instructions, I think it's the best choice, but I'm not sure about the imul. Any suggestions?

Thanks in advance,

Upvotes: 7

Views: 573

Answers (3)

Elad Weiss
Elad Weiss

Reputation: 4002

Afer viewing all wisening answers and comments,

I believe this is the correct answer:

When getting to such levels of micro-optimizatin, there is no one 'best' choice, as it may vary depending on platform, OS and the written software.

So, it seems to me the correct approach software-wise would be to create more than one implementation, and encapsulate them with some abstraction, so they can be easily switched.

When benchmarking, switch between them to see which one yields best results for the SITUATION.

Of course we can rule out solutions which are obviously worse than others.

Upvotes: 0

Lundin
Lundin

Reputation: 213832

Optimizing code isn't always as easy as counting assembler instructions and CPU ticks.

The multiplication method is likely the fastest on most systems, since it removes a branch. The multiplication instruction should be reasonably fast on most CPU cores.

What you could consider though, is if you really need to use such large integer types. On small 8 or 16 bit CPUs, the following code would be significantly faster:

uint_fast16_t foo2(bool cond, uint_fast16_t num)
{
    return (uint_fast16_t)cond * num;
}

On the other hand, such CPUs rarely come with branch prediction or instruction cache.

You shouldn't need to worry about manual function inlining. The compiler will inline this function automatically on most compilers.

Upvotes: 1

K.Hacene
K.Hacene

Reputation: 135

Multiplications and memory accesses take frequently more time than a simple if statement. If you want to optimize this code, the best way would be to use only "and" or "or" instructions (set it as inline to avoid a function call by the way).

Here is an 'optimized' example of your function using masks instead of booleans :

inline unsigned int foo1(unsigned int mask, unsigned int num)
{
  return mask & num;
}

Your call would look like this :

foo1(0, 10);     /* Returns 0  */
foo1(~0, 10);    /* Returns 10 */

Upvotes: 1

Related Questions