Reputation:
I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"
int a, b, low;
if (a < b) low = a;
else low = b;
is more or less efficient than this:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
(or the variation of putting int delta = a - b
at the top and rerplacing instances of a - b
with that).
I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.
Upvotes: 25
Views: 13980
Reputation: 7802
Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.
TLDR summary: avoiding if
s can help with vectorization.
Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if
statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)
Using the OP's provided example:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
Many compilers can vectorize this to be something approximately equivalent to:
__m128i low128i(__m128i a, __m128i b){
__m128i diff, tmp;
diff = _mm_sub_epi32(a,b);
tmp = _mm_srai_epi32(diff, 31);
tmp = _mm_and_si128(tmp,diff);
return _mm_add_epi32(tmp,b);
}
This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)
I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:
void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
__m128i _a=*a, _b=*b;
__m128i lomask = _mm_cmpgt_epi32(_a,_b),
__m128i himask = _mm_cmpgt_epi32(_b,_a);
_mm_maskmoveu_si128(_b,lomask,a);
_mm_maskmoveu_si128(_a,himask,b);
}
However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.
Upvotes: 0
Reputation: 3232
I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData )
function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
unsigned lsB = bctData & 1;
unsigned msB = bctData >> 1;
return
( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
static const int decoderLUT[] = { 0, 1, -1, 0 };
return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND
instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.
Upvotes: 0
Reputation: 48591
One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure
int foo (int a, int b) {
return ((a < b) ? a : b);
}
is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes
int bar = foo (x, x+3);
After inlining, the compiler will recognize that 3
is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get
int bar = x;
It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.
Upvotes: 2
Reputation: 86006
The biggest problem is that your second example won't work on 64-bit machines.
However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower
There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.
[Edit] Because I think this is such an interesting topic, I've written a blog post on it.
Upvotes: 7
Reputation: 17420
I'm just wondering which one of these would be more efficient (or if the difference is to miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.
Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)
On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)
Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.
But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.
The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.
Upvotes: 1
Reputation: 405765
Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.
Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.
Upvotes: 10
Reputation: 44266
Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:
int foo1(int a, int b)
{
int low;
if (a < b) low = a;
else low = b;
return low;
}
int foo2(int a, int b)
{
int low;
low = b + ((a - b) & ((a - b) >> 31));
return low;
}
I get:
foo1:
cmpl %edi, %esi
cmovle %esi, %edi
movl %edi, %eax
ret
foo2:
subl %esi, %edi
movl %edi, %eax
sarl $31, %eax
andl %edi, %eax
addl %esi, %eax
ret
...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.
Upvotes: 10
Reputation: 1546
If it is for Gnu C++, try this
int min = i <? j;
I have not profiled it but I think it is definitely the one to beat.
Upvotes: -1
Reputation: 16379
profile results with gcc -o foo -g -p -O0, Solaris 9 v240
%Time Seconds Cumsecs #Calls msec/call Name
36.8 0.21 0.21 8424829 0.0000 foo2
28.1 0.16 0.37 1 160. main
17.5 0.10 0.4716850667 0.0000 _mcount
17.5 0.10 0.57 8424829 0.0000 foo1
0.0 0.00 0.57 4 0. atexit
0.0 0.00 0.57 1 0. _fpsetsticky
0.0 0.00 0.57 1 0. _exithandle
0.0 0.00 0.57 1 0. _profil
0.0 0.00 0.57 1000 0.000 rand
0.0 0.00 0.57 1 0. exit
code:
int
foo1 (int a, int b, int low)
{
if (a < b)
low = a;
else
low = b;
return low;
}
int
foo2 (int a, int b, int low)
{
low = (a < b) ? a : b;
return low;
}
int main()
{
int low=0;
int a=0;
int b=0;
int i=500;
while (i--)
{
for(a=rand(), b=rand(); a; a--)
{
low=foo1(a,b,low);
low=foo2(a,b,low);
}
}
return 0;
}
Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct
Upvotes: 0
Reputation: 55415
For something as simple as this, why not just experiment and try it out?
Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.
I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.
Upvotes: 4
Reputation: 47428
Like with any low-level optimization, test it on the target CPU/board setup.
On my compiler (gcc 4.5.1 on x86_64), the first example becomes
cmpl %ebx, %eax
cmovle %eax, %esi
The second example becomes
subl %eax, %ebx
movl %ebx, %edx
sarl $31, %edx
andl %ebx, %edx
leal (%rdx,%rax), %esi
Not sure if the first one is faster in all cases, but I would bet it is.
Upvotes: 6
Reputation: 4485
One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.
Upvotes: 1
Reputation: 69342
Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.
I would profile the application an concentrate your optimization efforts to something more worthwhile.
Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.
For simple statements like this, I find the ternary operator very intuitive:
low = (a < b) ? a : b;
Clear and concise.
Upvotes: 4
Reputation: 7951
Why low = a;
in the if
and low = a;
in the else
? And, why 31
? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?
The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.
Upvotes: 0
Reputation: 10663
Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.
My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.
Upvotes: -1
Reputation: 83250
(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)
One advantage of eliminating an if
statement is that you avoid branch prediction penalties.
Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like
a = random() % 10
if (a < 5)
print "Less"
else
print "Greater"
then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.
One way to avoid these kinds of penalties is to use the ternary (?:
) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.
So
int a, b, low;
if (a < b) low = a;
else low = b;
becomes
int a, b, low;
low = (a < b) ? a : b
and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.
Of course, this is a micro-optimization which is unlikely to have significant impact on your code.
Upvotes: 34