Reputation: 850
I need to negate very large number of doubles quickly. If bit_generator generates 0, then the sign must be changed. If bit_generator generates 1, then nothing happens. The loop is run many times over and bit_generator is extremely fast. On my platform case 2 is noticeably faster than case 1. Looks like my CPU doesn't like branching. Is there any faster and portable way to do it? What do you think about case 3?
// generates 0 and 1
int bit_generator();
// big vector (C++)
vector<double> v;
// case 1
for (size_t i=0; i<v.size(); ++i)
if (bit_generator()==0)
v[i] = -v[i];
// case 2
const int sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];
// case 3
const double sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];
// case 4 uses C-array
double a[N];
double number_generator(); // generates doubles
double z[2]; // used as buffer
for (size_t i=0; i<N; ++i) {
z[0] = number_generator();
z[1] = -z[0];
a[i] = z[bit_generator()];
}
EDIT: Added case 4 and C-tag, because the vector can be a plain array. Since I can control how doubles are generated, I redesigned the code as shown in case 4. It avoids extra multiplication and branching at the same. I presume it should be quite fast on all platforms.
Upvotes: 4
Views: 2049
Reputation: 43497
On my machine, running at 5333.24 BogoMIPS, the timings for 1'000 iterations across an array of 1'000'000 doubles yields the following times per expression:
p->d = -p->d 7.33 ns
p->MSB(d) ^= 0x80 6.94 ns
Where MSB(d) is pseudo-code for grabbing the most significant byte of d
. This means that the naive d = -d
takes 5.32% longer to execute than the obfuscated approach. For a billion such negations this means the difference between 7.3 and 6.9 seconds.
Someone must have an awfully big pile of doubles to care about that optimization.
Incidentally, I had to print out the content of the array when completed or my compiler optimized the whole test into zero op codes.
Upvotes: -1
Reputation: 28747
Unless you want to resize the vector in the loop, lift the v.size() out of the for expression, i.e.
const unsigned SZ=v.size();
for (size_t i=0; i<SZ; ++i)
if (bit_generator()==0)
v[i] = -v[i];
If the compiler can't see what happens in bit_generator(), then it might be very hard for the compiler to prove that v.size() does not change, which makes loop unrolling or vectorization impossible.
UPDATE: I've made some tests and on my machine method 2 seems to be fastest. However, it seems to be faster to use a pattern which I call "group action" :-). Basically, you group multiple decisions into one value and switch over it:
const size_t SZ=v.size();
for (size_t i=0; i<SZ; i+=2) // manual loop unrolling
{
int val=2*bit_generator()+bit_generator();
switch(val) // only one conditional
{
case 0:
break; // nothing happes
case 1:
v[i+1]=-v[i+1];
break;
case 2:
v[i]=-v[i];
break;
case 3:
v[i]=-v[i];
v[i+1]=-v[i+1];
}
}
// not shown: wrap up the loop if SZ%2==1
Upvotes: 9
Reputation: 127458
if you can assume that the sign is represented by one specific bit, like in x86 implementations, you can simply do:
v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of
// bit_generator because 0 means
// negate and one means leave
// unchanged.
In x86 the sign bit is the MSB, so for doubles that's bit 63:
#define SIGN_BIT_POSITION 63
will do the trick.
Edit:
Based on comments, I should add that you might need to do some extra work to get this to compile, since v
is an array of double
, while bit_generator()
returns int
. You could do it like this:
union int_double {
double d; // assumption: double is 64 bits wide
long long int i; // assumption: long long is 64 bits wide
};
(syntax might be a bit different for C because you might need a typedef.)
Then define v
as a vector of int_double
and use:
v[i].i ^= bit_generator() << SIGN_BIT_POSITION;
Upvotes: 5
Reputation: 127458
assuming that the actual negation is fast (a good assumption on a modern compiler and CPU), you could use a conditional assignment, which is also fast on modern CPUs, to choose between two possibilities:
v[i] = bit_generator() ? v[i] : -v[i];
This avoids branches and allows the compiler to vectorize the loop and make it faster.
Upvotes: 1
Reputation: 96241
Are you able to rewrite bit_generator
so it returns 1 and -1 instead? That removes an indirection from the equation at the possible cost of some clarity.
Upvotes: 0
Reputation: 263128
You don't need the lookup table, a simple formula suffices:
const size_t size = v.size();
for (size_t i=0; i<size; ++i)
v[i] *= 2*bit_generator() - 1;
Upvotes: 2
Reputation: 42805
Generally, if you have an if()
inside a loop, that loop cannot be vectorized or unrolled, and the code has to execute once per pass, maximizing the loop overhead. Case 3 should perform very well, especially if the compiler can use SSE instructions.
For fun, if you're using GCC, use the -S -o foo.S -c foo.c
flags instead of the usual -o foo.o -c foo.c
flags. This will give you the assembly code, and you can see what is getting compiled for your three cases.
Upvotes: 3