Reputation: 2960
I'm an R developer who uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.
In R, our Boolean type can actually have three values, true
, false
, and na
, and we represent this using an int
at the C level.
I'm looking into a vectorized &&
operation (yes, we have this in R already, but bear with me) that also handles the na
case. The scalar results would look like this:
F && F == F
F && T == F
F && N == F
T && F == F
T && T == T
T && N == N
N && F == F
N && T == N
N && N == N
Note that it works like &&
in C, except that na
values propagate when combined with anything except false
, in which case we "know" that &&
can never be true, so we return false
.
Now to the implementation. Assume we have two vectors, v_out
and v_x
, and we'd like to perform the vectorized &&
on them. We are allowed to overwrite v_out
with the result. One option is:
// Option 1
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
And another option is:
// Option 2
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
if (elt_out == 0) {
continue;
}
int elt_x = v_x[i];
if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
I sort of expected the second option to be faster, because it avoids accessing v_x[i]
when it isn't required. But in fact it was twice as slow when compiled with -O2
!
In the following script, I get the following timing results. Note that I am on a Mac and compile with Clang.
It seems reasonable with O0. They are about the same.
2x faster with O2 with Option 1!
Option 1, `clang -O0`
0.110560
Option 2, `clang -O0`
0.107710
Option 1, `clang -O2`
0.032223
Option 2, `clang -O2`
0.070557
What is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i]
is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i]
is essentially being accessed randomly (sort of), because it might access v_x[10]
, but then not need another element from v_x
until v_x[120]
, and because that access isn't linear, it is probably much slower.
Reproducible script:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>
int main() {
srand(123);
int size = 1e7;
int na = INT_MIN;
int* v_out = (int*) malloc(size * sizeof(int));
int* v_x = (int*) malloc(size * sizeof(int));
// Generate random numbers between 1-3
// 1 -> false
// 2 -> true
// 3 -> na
for (int i = 0; i < size; ++i) {
int elt_out = rand() % 3 + 1;
if (elt_out == 1) {
v_out[i] = 0;
} else if (elt_out == 2) {
v_out[i] = 1;
} else {
v_out[i] = na;
}
int elt_x = rand() % 3 + 1;
if (elt_x == 1) {
v_x[i] = 0;
} else if (elt_x == 2) {
v_x[i] = 1;
} else {
v_x[i] = na;
}
}
clock_t start = clock();
// Option 1
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
// // Option 2
// for (int i = 0; i < size; ++i) {
// int elt_out = v_out[i];
//
// if (elt_out == 0) {
// continue;
// }
//
// int elt_x = v_x[i];
//
// if (elt_x == 0) {
// v_out[i] = 0;
// } else if (elt_out == na) {
// // Done
// } else if (elt_x == na) {
// v_out[i] = na;
// }
// }
clock_t end = clock();
double time = (double) (end - start) / CLOCKS_PER_SEC;
free(v_out);
free(v_x);
printf("%f\n", time);
return 0;
}
Based on a few questions in the comments, here are a few points of clarifications for future readers:
I am on a 2018 15-inch MacBook Pro with a 2.9 GHz 6-core Intel i9-8950HK (6 core Coffee Lake.)
My particular Clang version that I tested with is Apple clang version 13.1.6 (clang-1316.0.21.2.5)
with Target: x86_64-apple-darwin21.6.0
I am restricted by R to use int
as the data type (even though there are more efficient options) and the following coding: false = 0
, true = 1
, na = INT_MIN
. The reproducible example that I provided respects this.
The original question was not actually a request to make the code run faster. I just wanted to get an idea of what the difference was between my two if/else approaches. That said, some answers have shown that branchless approaches can be much faster, and I really appreciate the explanations that those users have provided! That has greatly influenced the final version of the implementation I am working on.
Upvotes: 62
Views: 7058
Reputation: 15164
Let’s take a look at what these code samples compile to, on Clang 15.0.0 with -std=c17 -O3 -march=x86-64-v3
. Other compilers will generate slightly different code; it’s finicky.
Factoring out your code snippets into functions, we get
#include <limits.h>
#include <stddef.h>
#define na INT_MIN
int* filter1( const size_t size,
int v_out[size],
const int v_x[size]
)
{
for ( size_t i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
return v_out;
}
int* filter2( const size_t size,
int v_out[size],
const int v_x[size]
)
{
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
if (elt_out == 0) {
continue;
}
int elt_x = v_x[i];
if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
return v_out;
}
Your option 1, filter1
here, compiles to a vectorized loop on Clang 15. (GCC 12 has trouble with it.) The loop body here compiles to:
.LBB0_8: # =>This Inner Loop Header: Depth=1
vmovdqu ymm3, ymmword ptr [r10 + 4*rsi - 32]
vmovdqu ymm4, ymmword ptr [rdx + 4*rsi]
vpcmpeqd ymm5, ymm3, ymm0
vpcmpeqd ymm6, ymm4, ymm0
vpxor ymm7, ymm6, ymm1
vpcmpgtd ymm3, ymm3, ymm2
vpcmpeqd ymm4, ymm4, ymm2
vpand ymm3, ymm3, ymm4
vpandn ymm4, ymm5, ymm6
vpandn ymm5, ymm5, ymm7
vpand ymm3, ymm5, ymm3
vpand ymm5, ymm3, ymm2
vpor ymm3, ymm3, ymm4
vpmaskmovd ymmword ptr [r10 + 4*rsi - 32], ymm3, ymm5
vmovdqu ymm3, ymmword ptr [r10 + 4*rsi]
vmovdqu ymm4, ymmword ptr [rdx + 4*rsi + 32]
vpcmpeqd ymm5, ymm3, ymm0
vpcmpeqd ymm6, ymm4, ymm0
vpxor ymm7, ymm6, ymm1
vpcmpgtd ymm3, ymm3, ymm2
vpcmpeqd ymm4, ymm4, ymm2
vpand ymm3, ymm3, ymm4
vpandn ymm4, ymm5, ymm6
vpandn ymm5, ymm5, ymm7
vpand ymm3, ymm5, ymm3
vpand ymm5, ymm3, ymm2
vpor ymm3, ymm3, ymm4
vpmaskmovd ymmword ptr [r10 + 4*rsi], ymm3, ymm5
add rsi, 16
add r9, -2
jne .LBB0_8
So we see the compiler optimized the loop to a series of SIMD comparisons (vpcmpeqd
instructions) to generate a bitmask that it will then use to do conditional moves with vpmaskmovd
. This looks more complex than it really is, because it’s partly unrolled to do two consecutive updates per iteration.
You will note that there are no branches, other than the test at the bottom of the loop for whether we are at the end of the array. However, because of conditional moves, we can sometimes get a cache miss on a load or a store. That is what I think sometimes happened in my tests.
Now let’s look at option 2:
.LBB1_8: # =>This Inner Loop Header: Depth=1
vmovdqu ymm3, ymmword ptr [r10 + 4*rsi - 32]
vpcmpeqd ymm4, ymm3, ymm0
vpxor ymm5, ymm4, ymm1
vpmaskmovd ymm5, ymm5, ymmword ptr [r11 + 4*rsi - 32]
vpcmpeqd ymm6, ymm5, ymm0
vpxor ymm7, ymm6, ymm1
vpcmpgtd ymm3, ymm3, ymm2
vpcmpeqd ymm5, ymm5, ymm2
vpand ymm3, ymm3, ymm5
vpandn ymm5, ymm4, ymm6
vpandn ymm4, ymm4, ymm7
vpand ymm3, ymm4, ymm3
vpand ymm4, ymm3, ymm2
vpor ymm3, ymm3, ymm5
vpmaskmovd ymmword ptr [r10 + 4*rsi - 32], ymm3, ymm4
vmovdqu ymm3, ymmword ptr [r10 + 4*rsi]
vpcmpeqd ymm4, ymm3, ymm0
vpxor ymm5, ymm4, ymm1
vpmaskmovd ymm5, ymm5, ymmword ptr [r11 + 4*rsi]
vpcmpeqd ymm6, ymm5, ymm0
vpxor ymm7, ymm6, ymm1
vpcmpgtd ymm3, ymm3, ymm2
vpcmpeqd ymm5, ymm5, ymm2
vpand ymm3, ymm3, ymm5
vpandn ymm5, ymm4, ymm6
vpandn ymm4, ymm4, ymm7
vpand ymm3, ymm4, ymm3
vpand ymm4, ymm3, ymm2
vpor ymm3, ymm3, ymm5
vpmaskmovd ymmword ptr [r10 + 4*rsi], ymm3, ymm4
add rsi, 16
add r9, -2
jne .LBB1_8
Similar code on this compiler, but slightly longer. One difference is a conditional move from the v_x
vector.
However, that is with -march=x86-64-v3
. If you don’t tell it it’s allowed to use AVX2 instructions, such as vpmaskmovd
, Clang 15.0.0 will give up on vectorizing this version of the algorithm at all.
For comparison, we could refactor this code, taking advantage of the fact that the updated value of v_out[i]
will always be equal to either v_out[i]
or v_x[i]
:
int* filter3( const size_t size,
int v_out[size],
const int v_x[size]
)
{
for ( size_t i = 0; i < size; ++i) {
const int elt_out = v_out[i];
const int elt_x = v_x[i];
v_out[i] = (elt_out == 0) ? elt_out :
(elt_x == 0) ? elt_x :
(elt_out == na) ? elt_out :
(elt_x == na) ? elt_x :
elt_out;
}
return v_out;
}
And this gets us some very different code:
.LBB2_7: # =>This Inner Loop Header: Depth=1
vmovdqu ymm6, ymmword ptr [rax + 4*rsi]
vmovdqu ymm4, ymmword ptr [rax + 4*rsi + 32]
vmovdqu ymm3, ymmword ptr [rax + 4*rsi + 64]
vmovdqu ymm2, ymmword ptr [rax + 4*rsi + 96]
vmovdqu ymm7, ymmword ptr [rdx + 4*rsi]
vmovdqu ymm8, ymmword ptr [rdx + 4*rsi + 32]
vmovdqu ymm9, ymmword ptr [rdx + 4*rsi + 64]
vmovdqu ymm5, ymmword ptr [rdx + 4*rsi + 96]
vpcmpeqd ymm10, ymm6, ymm0
vpcmpeqd ymm11, ymm4, ymm0
vpcmpeqd ymm12, ymm3, ymm0
vpcmpeqd ymm13, ymm2, ymm0
vpcmpeqd ymm14, ymm7, ymm0
vpor ymm10, ymm10, ymm14
vpcmpeqd ymm14, ymm8, ymm0
vpor ymm11, ymm11, ymm14
vpcmpeqd ymm14, ymm9, ymm0
vpor ymm12, ymm12, ymm14
vpcmpeqd ymm14, ymm5, ymm0
vpcmpeqd ymm7, ymm7, ymm1
vblendvps ymm7, ymm6, ymm1, ymm7
vpor ymm13, ymm13, ymm14
vpcmpeqd ymm6, ymm6, ymm1
vpandn ymm6, ymm10, ymm6
vpandn ymm7, ymm10, ymm7
vpcmpeqd ymm8, ymm8, ymm1
vblendvps ymm8, ymm4, ymm1, ymm8
vpcmpeqd ymm4, ymm4, ymm1
vpcmpeqd ymm9, ymm9, ymm1
vblendvps ymm9, ymm3, ymm1, ymm9
vpandn ymm4, ymm11, ymm4
vpandn ymm8, ymm11, ymm8
vpcmpeqd ymm3, ymm3, ymm1
vpandn ymm3, ymm12, ymm3
vpandn ymm9, ymm12, ymm9
vpcmpeqd ymm5, ymm5, ymm1
vblendvps ymm5, ymm2, ymm1, ymm5
vpcmpeqd ymm2, ymm2, ymm1
vpandn ymm2, ymm13, ymm2
vpandn ymm5, ymm13, ymm5
vblendvps ymm6, ymm7, ymm1, ymm6
vblendvps ymm4, ymm8, ymm1, ymm4
vblendvps ymm3, ymm9, ymm1, ymm3
vblendvps ymm2, ymm5, ymm1, ymm2
vmovups ymmword ptr [rax + 4*rsi], ymm6
vmovups ymmword ptr [rax + 4*rsi + 32], ymm4
vmovups ymmword ptr [rax + 4*rsi + 64], ymm3
vmovups ymmword ptr [rax + 4*rsi + 96], ymm2
add rsi, 32
cmp r11, rsi
jne .LBB2_7
Although this looks longer, this is updating four vectors on each iteration, and is in fact blending the v_out
and v_x
vectors with a bitmask. The GCC 12.2 version of this loop follows similar logic with one update per iteration, so is more concise:
.L172:
vmovdqu ymm3, YMMWORD PTR [rcx+rax]
vpcmpeqd ymm0, ymm2, YMMWORD PTR [rsi+rax]
vpcmpeqd ymm1, ymm3, ymm2
vpcmpeqd ymm6, ymm3, ymm4
vpcmpeqd ymm0, ymm0, ymm2
vpcmpeqd ymm1, ymm1, ymm2
vpand ymm0, ymm0, ymm1
vpcmpeqd ymm1, ymm4, YMMWORD PTR [rsi+rax]
vpor ymm1, ymm1, ymm6
vpand ymm6, ymm0, ymm1
vpandn ymm1, ymm1, ymm0
vpxor ymm0, ymm0, ymm5
vpblendvb ymm0, ymm3, ymm2, ymm0
vpblendvb ymm0, ymm0, ymm3, ymm1
vpblendvb ymm0, ymm0, ymm4, ymm6
vmovdqu YMMWORD PTR [rcx+rax], ymm0
add rax, 32
cmp rdx, rax
jne .L172
This, as you see, is roughly as tight as a rolled-up version of 1 and 3 that did one update per iteration, but some optimizers seem to have less trouble with it. A similar version, whose code differs mainly in register allocations, would be:
int* filter4( const size_t size,
int v_out[size],
const int v_x[size]
)
{
for ( size_t i = 0; i < size; ++i) {
const int elt_out = v_out[i];
const int elt_x = v_x[i];
v_out[i] = (elt_out == 0) ? 0 :
(elt_x == 0) ? 0 :
(elt_out == na) ? na :
(elt_x == na) ? na :
elt_out;
}
return v_out;
}
What seems to have happened is that your compiler was able to vectorize your version 1, but not your version 2, on the settings you were using. If it can vectorize both, they perform similarly.
In 2022, a compiler with aggressive optimization settings can turn any of these loops into vectorized branchless code, at least if you enable AVX2. If you do, the second version is capable, as you thought, of loading from v_x
conditionally. (This does lead to a big observable difference when you initialize v_out
to all-zeroes.) Compilers in 2022 seem to do better with the single assignment statements of version 3 and 4 than the if
blocks of 1 and 2. They vectorize on some targets and settings on which 1 and 2 do not, and even when all four do, Clang 15.0.0 unrolls 3 and 4 more aggressively than 1 and 2.
With AVX-512 instructions enabled, the compiler can optimize all four versions to similar branchless code, and there isn't any significant difference in performance. With other targets (in particular, -O3 -march=x86-64-v2
and -O3 -march=x86-64-v3
), Clang 15.0.0 does significantly better with versions 3 and_ 4 than 1 and 2.
However, if you are willing to change the behavior of the function for some inputs, you can remove the comparisons and conditional moves for further speedup, as in Peter Cordes’ and Karl Knechtel's answers. Here, I wanted to compare like to like.
In my testing, which version was faster depended heavily on what the input values were initialized to. With the same random seed you used, filter1
was slightly faster than the other three, but with truly randomized data, any of the four could be faster.
Upvotes: 12
Reputation: 365812
If you want fast vectorized code, don't do short-circuit evaluation, and don't branch in general. You want the compiler to be able to do 16 or 32 elements at once with SIMD operations, using 8-bit elements. (Compilers can optimize if
s to branchless code if it's safe to do the work unconditionally, including dereferences, and there aren't side-effects. This is called if-conversion, and is typically necessary for auto-vectorization of code like this.)
And you don't want the compiler to worry that it's not allowed to touch some memory because the C abstract machine doesn't. E.g., if all the v_out[i]
elements are false, the v_x
might be a NULL pointer without causing UB! So the compiler can't invent read access to objects that the C logic doesn't read at all.
If v_x
were truly an array, not just a pointer, the compiler would know that it's readable and would be allowed to invent accesses to it by doing if-conversion of the short-circuit logic to branchless. But if its cost heuristics don't see a really big benefit (like auto-vectorization), it might choose not to. Branchy code will often in practice be slower with a random mix of trues and falses (and NAs).
As you can see in the compiler's assembly output (Clang 15 -O2 on Compiler Explorer), option 1 auto-vectorizes with SIMD, branchlessly handling 4 optional-bools in parallel (with just SSE2, more with -march=native
). (Thanks to Richard in comments for cooking up the Compiler Explorer link; it's probably reflective of what Apple Clang will do to your real code in main
.)
Your 3-state bool that supports an NA state can be implemented with 2 bits, in a way that bitwise AND does your &&
operation. You can store arrays of it as one per unsigned char
, or packed 4 per char to quadruple your throughput for vectorized operations, at the cost of slower access. (Or in general CHAR_BIT/2
per char
, but on mainstream C implementations for x86 that's 4.)
F = 00
N = 10
(in binary, so C 0b10
aka 2
)T = 11
bool
with val & 1
.bool
with 0b11 * b
or something to broadcast the low bit to both positions.F & anything = 0
because F is all-zero bits. N&N == N
; that's trivially true for any bit-pattern.
The "clever" part is that N&T = T&N = N
, since the set bits in T
are a superset of those in N
.
This also works for ||
with bitwise |
:
F|N == N
and F|T == T
because 0|x == x
. Also x|x == x
for any same input so we're still fine there.
N = 0b10
won't set the low bit when ORing, but will clear it when ANDing.
I forgot you said C instead of C++, so this bare-bones class wrapper (only sufficient to demo a few test callers) might not be relevant, but a loop doing c1[i] &= c2[i];
in plain C for unsigned char *c1, *c2
will auto-vectorize exactly the same way.
struct NBool{ // Nullable bool, should probably rename to optional bool
unsigned char val;
static const unsigned char F = 0b00;
static const unsigned char T = 0b11;
static const unsigned char N = 0b10; // N&T = N; N&N = N; N&F = F
auto operator &=(NBool rhs){ // define && the same way if you want, as non-short-circuiting
val &= rhs.val;
return *this;
}
operator bool() { return val & 1; }
constexpr NBool(unsigned char x) : val(x) {};
constexpr NBool& operator=(const NBool &) = default;
};
#include <stdint.h>
#include <stdlib.h>
bool test(NBool a){
return a;
}
bool test2(NBool a){
NBool b = NBool::F;
return a &= b; // return false
}
void foo(size_t len, NBool *a1, NBool *a2 )
{
for (std::size_t i = 0 ; i < len ; i++){
a1[i] &= a2[i];
}
}
(I think "Nullable" isn't really correct terminology for something that can have be NaN / NA; it's always safe to read, and it's not a reference in the first place. Maybe optional_bool, like C++ std::optional
which is a value that may or may not be present.)
This compiles on Compiler Explorer with GCC and Clang. Clang auto-vectorizes fairly nicely with an unrolled loop doing vandps
. (A bit of a weird choice by clang; on -march=haswell
, vpand
has better throughput.) But it is still limited by 1/clock store and 2/clock load anyway; this very much bottlenecks on load/store with such low computational intensity, even if the data is hot in L1d cache.
(Intel's optimization manual says that even though Skylake's peak L1d bandwidth is 2 loads + 1 store per clock (96 bytes with 32-byte vectors), the sustained bandwidth is more like 84 bytes per clock.)
It can still come relatively close to 32 bytes ANDed per clock cycle, with AVX. So that's 32 NBool &
operations, or 128 per clock if you pack 4 NBools per byte.
Packing NBools down to a packed bitmap of 1-bit bools could be done with pslld xmm, 7
/ pmovmskb
to extract the low bit of each byte (after shifting it to the high bit).
If stored 4 per byte, some SIMD bit-manipulation is in order to pack to bools, perhaps vpshufb
as a 4-bit LUT to pack pairs of NBools down to a pair of bools in the bottom of a nibble, then combine? Or use scalar BMI2 pext
to extract every other bit from 64 bits, if you're on Zen 3 or Haswell or later, for fast pext
.
Upvotes: 70
Reputation: 61643
Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.
Rather than represent the values as a strict enumeration, allow a numeric value of either 2
or 3
to represent na
(you can check this when displaying, or have a normalization step after all the number-crunching). This way, no conditional logic (and thus no costly branch prediction) is needed: we simply logical-or the bit in the 2s place (regardless of the operator), and logical-and (or whichever operator) the bit in the 1s place.
int is_na(int value) {
return value & 2;
}
void r_and_into(unsigned* v_out, unsigned* v_x, int size) {
for (int i = 0; i < size; ++i) {
unsigned elt_out = v_out[i];
unsigned elt_x = v_x[i];
// this can probably be micro-optimized somehow....
v_out[i] = (elt_out & elt_x & 1) | ((elt_out | elt_x) & 2);
}
}
If we are forced to use INT_MIN
to represent the N/A value, we can start by observing what that looks like in two's complement: it has exactly one bit set (the sign bit, which would be most-significant in unsigned values). Thus, we can use that bit value instead of 2
with the same kind of unconditional logic, and then correct any (INT_MIN | 1
) results to INT_MIN
:
const unsigned MSB_FLAG = (unsigned)INT_MIN;
void r_and_into(int* v_out, int* v_x, int size) {
for (int i = 0; i < size; ++i) {
unsigned elt_out = (unsigned)v_out[i];
unsigned elt_x = (unsigned)v_x[i];
elt_out = (elt_out & elt_x & 1) | ((elt_out | elt_x) & MSB_FLAG);
// if the high bit is set, clear the low bit
// I.E.: AND the low bit with the negation of the high bit.
v_out[i] = (int)(elt_out & ~(elt_out >> 31));
}
}
(All these casts might not be necessary, but I think it is good practice to use unsigned types for bitwise manipulations. They should all get optimized away anyway.)
Upvotes: 18
Reputation: 181714
Why is this seemingly slower C loop actually twice as fast as the other way?
At a high level, it is a quirk of the compiler and execution environment you are using. Unless array v_x
is declared volatile
, the compiler is free to interpret the two variations on your code exactly the same way.
I sort of expected the second option to be faster because it avoids accessing
v_x[i]
when it isn't required.
And if the compiler's optimizer judged that to be true, then it could make use of that judgement to conditionally avoid reading v_x[i]
in conjunction with the first code.
But at a lower level, if the compiler generates code that indeed does conditionally avoid reading v_x[i]
in option 2 but not in option 1, then you are probably observing the effects of branch misprediction in the option 2 case. It is entirely plausible that it is cheaper on average to read v_x[i]
unconditionally than to suffer large numbers of branch misprediction penalties involving whether it should be read.
One of the takeaways is that on modern hardware, branches can be a lot more expensive than one might expect, especially when the branch is difficult for the CPU to predict. Where the same computation can be performed via a branchless approach, that may yield a performance win in practice, at, usually, a cost in source code clarity. @KarlKnechtel's answer presents a possible branchless (but for testing the for
loop condition, which is pretty predictable) variation on the computation you are trying to perform.
Upvotes: 23