user401947
user401947

Reputation: 850

C/C++ optimization: negate doubles fast

I need to negate very large number of doubles quickly. If bit_generator generates 0, then the sign must be changed. If bit_generator generates 1, then nothing happens. The loop is run many times over and bit_generator is extremely fast. On my platform case 2 is noticeably faster than case 1. Looks like my CPU doesn't like branching. Is there any faster and portable way to do it? What do you think about case 3?

// generates 0 and 1
int bit_generator();

// big vector (C++)
vector<double> v;

// case 1
for (size_t i=0; i<v.size(); ++i)
    if (bit_generator()==0)
        v[i] = -v[i];

// case 2
const int sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
        v[i] *= sign[bit_generator()];

// case 3
const double sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
        v[i] *= sign[bit_generator()];

// case 4 uses C-array
double a[N];
double number_generator(); // generates doubles
double z[2]; // used as buffer
for (size_t i=0; i<N; ++i) {
        z[0] = number_generator();
        z[1] = -z[0];
        a[i] = z[bit_generator()];
}

EDIT: Added case 4 and C-tag, because the vector can be a plain array. Since I can control how doubles are generated, I redesigned the code as shown in case 4. It avoids extra multiplication and branching at the same. I presume it should be quite fast on all platforms.

Upvotes: 4

Views: 2049

Answers (7)

msw
msw

Reputation: 43497

Premature optimization is the root of insipid SO questions

On my machine, running at 5333.24 BogoMIPS, the timings for 1'000 iterations across an array of 1'000'000 doubles yields the following times per expression:

p->d = -p->d         7.33 ns
p->MSB(d) ^= 0x80    6.94 ns

Where MSB(d) is pseudo-code for grabbing the most significant byte of d. This means that the naive d = -d takes 5.32% longer to execute than the obfuscated approach. For a billion such negations this means the difference between 7.3 and 6.9 seconds.

Someone must have an awfully big pile of doubles to care about that optimization.

Incidentally, I had to print out the content of the array when completed or my compiler optimized the whole test into zero op codes.

Upvotes: -1

Nordic Mainframe
Nordic Mainframe

Reputation: 28747

Unless you want to resize the vector in the loop, lift the v.size() out of the for expression, i.e.

const unsigned SZ=v.size();
for (size_t i=0; i<SZ; ++i)
    if (bit_generator()==0)
        v[i] = -v[i];

If the compiler can't see what happens in bit_generator(), then it might be very hard for the compiler to prove that v.size() does not change, which makes loop unrolling or vectorization impossible.

UPDATE: I've made some tests and on my machine method 2 seems to be fastest. However, it seems to be faster to use a pattern which I call "group action" :-). Basically, you group multiple decisions into one value and switch over it:

const size_t SZ=v.size();
for (size_t i=0; i<SZ; i+=2) // manual loop unrolling
{
 int val=2*bit_generator()+bit_generator();
 switch(val) // only one conditional
 {
  case 0: 
     break; // nothing happes
  case 1: 
     v[i+1]=-v[i+1]; 
     break; 
  case 2: 
     v[i]=-v[i]; 
     break; 
  case 3: 
    v[i]=-v[i];
    v[i+1]=-v[i+1]; 
 }
}
// not shown: wrap up the loop if SZ%2==1 

Upvotes: 9

Nathan Fellman
Nathan Fellman

Reputation: 127458

if you can assume that the sign is represented by one specific bit, like in x86 implementations, you can simply do:

v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of
                                               // bit_generator because 0 means 
                                               // negate and one means leave 
                                               // unchanged.

In x86 the sign bit is the MSB, so for doubles that's bit 63:

#define SIGN_BIT_POSITION 63 

will do the trick.

Edit:

Based on comments, I should add that you might need to do some extra work to get this to compile, since v is an array of double, while bit_generator() returns int. You could do it like this:

union int_double {
    double d;        // assumption: double is 64 bits wide
    long long int i; // assumption: long long is 64 bits wide
};

(syntax might be a bit different for C because you might need a typedef.)

Then define v as a vector of int_double and use:

v[i].i ^= bit_generator() << SIGN_BIT_POSITION;

Upvotes: 5

Nathan Fellman
Nathan Fellman

Reputation: 127458

assuming that the actual negation is fast (a good assumption on a modern compiler and CPU), you could use a conditional assignment, which is also fast on modern CPUs, to choose between two possibilities:

v[i] = bit_generator() ? v[i] : -v[i];

This avoids branches and allows the compiler to vectorize the loop and make it faster.

Upvotes: 1

Mark B
Mark B

Reputation: 96241

Are you able to rewrite bit_generator so it returns 1 and -1 instead? That removes an indirection from the equation at the possible cost of some clarity.

Upvotes: 0

fredoverflow
fredoverflow

Reputation: 263128

You don't need the lookup table, a simple formula suffices:

const size_t size = v.size();
for (size_t i=0; i<size; ++i)
    v[i] *= 2*bit_generator() - 1;

Upvotes: 2

Mike DeSimone
Mike DeSimone

Reputation: 42805

Generally, if you have an if() inside a loop, that loop cannot be vectorized or unrolled, and the code has to execute once per pass, maximizing the loop overhead. Case 3 should perform very well, especially if the compiler can use SSE instructions.

For fun, if you're using GCC, use the -S -o foo.S -c foo.c flags instead of the usual -o foo.o -c foo.c flags. This will give you the assembly code, and you can see what is getting compiled for your three cases.

Upvotes: 3

Related Questions