Reputation: 495
I have a function like this in C (in pseudo-ish code, dropping the unimportant parts):
int func(int s, int x, int *a, int *r) {
int i;
// do some stuff
for (i = 0; i < a_really_big_int; ++i) {
if (s)
r[i] = x ^ i;
else
r[i] = x ^ a[i];
// and maybe a couple other ways of computing r
// that are equally fast individually
}
// do some other stuff
}
This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things:
Since the switch s
is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time?
If not, what is a good way to optimize this code?
====
EDIT: Here is an update with a fuller example:
int func(int s,
int start, int stop, int stride,
double *x, double *b,
int *a, int *flips, int *signs, int i_max,
double *c)
{
int i,k,st;
for (k=start; k<stop; k += stride) {
b[k] = 0;
for (i=0;i<i_max;++i) {
/* this is the code in question */
if (s) st = k^flips[i];
else st = a[k]^flips[i];
/* done with code in question */
b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]);
}
}
}
EDIT 2:
In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max
) outside, making the really_big_int
loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)
Upvotes: 2
Views: 3793
Reputation: 7802
Assuming x86_64, you can ensure that the pointers are aligned to 16 bytes and use intrinsics. If it is only running on systems with AVX2, you could use the __mm256 variants (similar for avx512*)
int func(int s, int x, const __m128i* restrict a, __m128i* restrict r) {
size_t i = 0, max = a_really_big_int / 4;
__m128i xv = _mm_set1_epi32(x);
// do some stuff
if (s) {
__m128i iv = _mm_set_epi32(3,2,1,0); //or is it 0,1,2,3?
__m128i four = _mm_set1_epi32(4);
for ( ;i<max; ++i, iv=_mm_add_epi32(iv,four)) {
r[i] = _mm_xor_si128(xv,iv);
}
}else{ /*not (s)*/
for (;i<max;++i){
r[i] = _mm_xor_si128(xv,a[i]);
}
}
// do some other stuff
}
Upvotes: 1
Reputation: 153303
Micro-optimizations
Usually they are not worth the time - reviewing larger issue is more effective.
Yet to micro-optimize, trying a variety of approaches and then profiling them to find the best can make for modest improvements.
In addition to @wallyk and @kabanus fine answers, some simplistic compilers benefit with a loop that ends in 0.
// for (i=0;i<a_really_big_int;++i) {
for (i=a_really_big_int; --i; ) {
[edit 2nd optimization]
OP added a more compete example. One of the issues is that the compiler can not make assumption that that the memory pointed to by b
and others do not overlap. This prevents certain optimizations.
Assuming they in fact to do not overlap, use restrict
on b
to allow optimizations. const
helps too for weaker compilers that do no deduce that. restrict
on the others may also benefit, again, if the reference data does not overlap.
// int func(int s, int start, int stop, int stride, double *x,
// double *b, int *a, int *flips,
// int *signs, int i_max, double *c) {
int func(int s, int start, int stop, int stride, const double * restrict x,
double * restrict b, const int * restrict a, const int * restrict flips,
const int * restrict signs, int i_max, double *c) {
Upvotes: 3
Reputation: 19221
Although the if
statement will be optimized away on any decent compiler (unless you asked the compiler not to optimize), I would consider writing the optimization in (just in case you compile without optimizations).
In addition, although the compiler might optimize the "absolute" if
statement, I would consider optimizing it manually, either using any available builtin, or using bitwise operations.
i.e.
b[k] += x[st] *
( ((__builtin_popcount(st & signs[I]) & 1) *
((int)0xFFFFFFFFFFFFFFFF)) ^c[I] );
This will take the last bit of popcount
(1 == odd, 0 == even), multiply it by the const (all bits 1 if odd, all bits 0 if true) and than XOR the c[I]
value (which is the same as 0-c[I]
or ~(c[I])
.
This will avoid instruction jumps in cases where the second absolute
if statement isn't optimized.
P.S.
I used an 8 byte long value and truncated it's length by casting it to an int
. This is because I have no idea how long an int
might be on your system (it's 4 bytes on mine, which is 0xFFFFFFFF
).
Upvotes: 0
Reputation: 25895
All your commands are quick O(1) command in the loop. The if
is definitely optimized, so is your for+if if all your commands are of the form r[i]=somethingquick
. The question may boil down for you on how small can big int be?
A quick int main
that just goes from INT_MIN
to INT_MAX
summing into a long variable, takes ~10 seconds for me on the Ubuntu subsystem on Windows. Your commands may multiply this by a few, which quickly gets to a minute. Bottom line, this may be not be avoidable if you really are iterating a ton.
If r[i]
are calculated independently, this would be a classic usage for threading/multi-processing.
EDIT:
I think %
is optimized anyway by the compiler, but if not, take care that x & 1
is much faster for an odd/even check.
Upvotes: 1
Reputation: 57764
One obvious way to optimize the code is pull the conditional outside the loop:
if (s)
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ i;
}
else
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ a[i];
}
A shrewd compiler might be able to change that into r[] assignments of more than one element at a time.
Upvotes: 5