Reputation: 4002
Here is what I'm trying to acheive. It's simple enough:
unsigned int foo1(bool cond, unsigned int num)
{
return cond ? num : 0;
}
Assmebly:
test dil, dil
mov eax, 0
cmovne eax, esi
ret
My question is, is there a faster way to do it? Here are some ways I thought of:
unsigned int foo2(bool cond, unsigned int num)
{
return cond * num;
}
Assmbly:
movzx eax, dil
imul eax, esi
ret
unsigned int foo3(bool cond, unsigned int num)
{
static const unsigned int masks[2] = { 0x0, 0xFFFFFFFF };
return masks[cond] & num;
}
Assembly:
movzx edi, dil
mov eax, DWORD PTR foo3(bool, unsigned int)::masks[0+rdi*4]
and eax, esi
ret
unsigned int foo4(bool cond, unsigned int num)
{
return (0 - (unsigned)cond) & num;
}
Assembly:
movzx eax, dil
neg eax
and eax, esi
ret
Now, multiplication yields the least instructions, I think it's the best choice, but I'm not sure about the imul. Any suggestions?
Thanks in advance,
Upvotes: 7
Views: 573
Reputation: 4002
Afer viewing all wisening answers and comments,
I believe this is the correct answer:
When getting to such levels of micro-optimizatin, there is no one 'best' choice, as it may vary depending on platform, OS and the written software.
So, it seems to me the correct approach software-wise would be to create more than one implementation, and encapsulate them with some abstraction, so they can be easily switched.
When benchmarking, switch between them to see which one yields best results for the SITUATION.
Of course we can rule out solutions which are obviously worse than others.
Upvotes: 0
Reputation: 213832
Optimizing code isn't always as easy as counting assembler instructions and CPU ticks.
The multiplication method is likely the fastest on most systems, since it removes a branch. The multiplication instruction should be reasonably fast on most CPU cores.
What you could consider though, is if you really need to use such large integer types. On small 8 or 16 bit CPUs, the following code would be significantly faster:
uint_fast16_t foo2(bool cond, uint_fast16_t num)
{
return (uint_fast16_t)cond * num;
}
On the other hand, such CPUs rarely come with branch prediction or instruction cache.
You shouldn't need to worry about manual function inlining. The compiler will inline this function automatically on most compilers.
Upvotes: 1
Reputation: 135
Multiplications and memory accesses take frequently more time than a simple if statement. If you want to optimize this code, the best way would be to use only "and" or "or" instructions (set it as inline to avoid a function call by the way).
Here is an 'optimized' example of your function using masks instead of booleans :
inline unsigned int foo1(unsigned int mask, unsigned int num)
{
return mask & num;
}
Your call would look like this :
foo1(0, 10); /* Returns 0 */
foo1(~0, 10); /* Returns 10 */
Upvotes: 1