Reputation: 5440

Need help optimizing code (minimum image convention)

I have written some simulation code and am using the "randomly break in GDB" method of debugging. I am finding that 99.9% of my program's time is spent in this routine (it's the minimum image convention):

inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) {
        double xhw, yhw, zhw, x, y, z;                                                                                 

        xhw = boxWidth / 2.0;                                                                                      
        yhw = xhw;                                                                                      
        zhw = xhw;                                                                                      

        x = position2[0] - position1[0];                                                                               
        if (x > xhw) 
                x -= boxWidth;                                                                                     
        else if (x < -xhw)
                x += boxWidth;                                                                                     


        y = position2[1] - position1[1];                                                                               
        if (y > yhw) 
                y -= boxWidth;                                                                                     
        else if (y < -yhw)
                y += boxWidth;                                                                                     


        z = position2[2] - position1[2];                                                                               
        if (z > zhw) 
                z -= boxWidth;                                                                                     
        else if (z < -zhw)
                z += boxWidth;                                                                                     

        return x * x + y * y + z * z;                                                                                  
}

The optimizations I have performed so far (maybe not very significant ones):

Return the square of the distance instead of the square root
Inline it
Const what I can
No standard library bloat
Compiling with every g++ optimization flag I can think of

I am running out of things I can do with this. Maybe I could use floats instead of doubles but I would prefer that be a last resort. And maybe I could somehow use SIMD on this, but I've never done that so I imagine that's a lot of work. Any ideas?

Thanks

Upvotes: 1

Answers (4)

Robert T. McGibbon

Reputation: 5287

First, you're not using the right algorithm. What if the two points are greater than boxWidth apart? Second, if you have multiple particles, calling a single function that does all of the distance calculations and places the results in an output buffer is going to be significantly more efficient. Inlining helps reduce some of this, but not all. Any of the precalculation -- like dividing the box length by 2 in your algorithm -- is going to be repeated when it doesn't need to be.

Here is some SIMD code to do the calculation. You need to compile with -msse4. Using -O3, on my machine (macbook pro, llvm-gcc-4.2), I get a speed up of about 2x. This does require using 32bit floats instead of double precision arithmetic.

SSE really isn't that complicated, it just looks terrible. e.g. instead of writing a*b, you have to write the clunky _mm_mul_ps(a,b).

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>

// you can compile this code with -DDOUBLE to try using doubles vs. floats
// in the unoptimized code. The SSE code uses only floats.
#ifdef DOUBLE
typedef double real;
#else
typedef float real;
#endif

static inline __m128 loadFloat3(const float const* value) {
  // Load (x,y,z) into a SSE register, leaving the last entry
  // set to zero.
  __m128 x = _mm_load_ss(&value[0]);
  __m128 y = _mm_load_ss(&value[1]);
  __m128 z = _mm_load_ss(&value[2]);
  __m128 xy = _mm_movelh_ps(x, y);
  return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0));
}

int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth,
                          float* out, const int n_points) {
  int i;
  __m128 r1, r2, r12, s12, r12_2, s, box, invBox;

  box = _mm_set1_ps(boxWidth);
  invBox = _mm_div_ps(_mm_set1_ps(1.0f), box);

  for (i = 0; i < n_points; i++) {
    r1 = loadFloat3(position1);
    r2 = loadFloat3(position1);
    r12 = _mm_sub_ps(r1, r2);

    s12 = _mm_mul_ps(r12, invBox);
    s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT));
    r12 = _mm_mul_ps(box, s12);

    r12_2 = _mm_mul_ps(r12, r12);

    // double horizontal add instruction accumulates the sum of
    // all four elements into each of the elements
    // (e.g. s.x = s.y = s.z = s.w =  r12_2.x + r12_2.y + r12_2.z + r12_2.w)
    s = _mm_hadd_ps(r12_2, r12_2);
    s = _mm_hadd_ps(s, s);

    _mm_store_ss(out++, s);
    position1 += 3;
    position2 += 3;
  }
  return 1;
}

inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) {
  real xhw, yhw, zhw, x, y, z;                                                                                 

  xhw = boxWidth / 2.0;                                                                                      
  yhw = xhw;                                                                                      
  zhw = xhw;                                                                                      

  x = position2[0] - position1[0];                                                                               
  if (x > xhw) 
    x -= boxWidth;                                                                                     
  else if (x < -xhw)
    x += boxWidth;                                                                                     


  y = position2[1] - position1[1];                                                                               
  if (y > yhw) 
    y -= boxWidth;                                                                                     
  else if (y < -yhw)
    y += boxWidth;                                                                                     


  z = position2[2] - position1[2];                                                                               
  if (z > zhw) 
    z -= boxWidth;                                                                                     
  else if (z < -zhw)
    z += boxWidth;                                                                                     

  return x * x + y * y + z * z;                                                                                  
}


int main(void) {
  real* position1;
  real* position2;
  real* output;
  int n_runs = 10000000;
  posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real));
  posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real));
  posix_memalign((void**) &output, 16, n_runs*sizeof(real));
  real boxWidth = 1.8;
  real result = 0;
  int i;
  clock_t t;

#ifdef OPT
  printf("Timing optimized SSE implementation\n");
#else
  printf("Timinig original implementation\n");
#endif

#ifdef DOUBLE
  printf("Using double precision\n");
#else
  printf("Using single precision\n");
#endif

  t = clock();
#ifdef OPT
  fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs);
#else
  for (i = 0; i < n_runs; i++) {
    *output = distanceSqPeriodic(position1, position2,  boxWidth);
    position1 += 3;
    position2 += 3;
    output++;
  }
#endif
  t = clock() - t;

  printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC);
}

Upvotes: 2

Aki Suihkonen

Reputation: 20037

You need to get rid of the comparisons, as those are hard to predict.

The function to be implemented is:

     /    /           /\  /\
    /    /           /  \/  \
    ----0-----  or ------------ , as (-x)^2 == x^2
       /    / 
      /    /

The latter is a result of two abs statements:

 x = abs(half-abs(diff))+half;

The code

double tst(double a[4], double b[4], double half)
{
   double sum=0.0,t;
   int i;
   for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;}
   return sum;
}

beats the original implementation by a factor of four (+some) -- and at this point there's not even full parallelism: only the lower half of xmm registers are used.

With parallel processing of x && y, there's a theoretical gain of about 50% to be achieved. Using floats instead of doubles could in theory make it still about 3x faster.

Upvotes: 0

tletnes

Reputation: 1998

you may want to use fabs (standarized in ISO 90 C) since this should be able to be reduced to a single non-branching instruction.

Upvotes: 1

Sebastian Mach

Reputation: 39099

Return the square of the distance instead of the square root

That's a good idea as long as you are comparing squares to squares.

Inline it

This is sometimes a counter-optimization: Inlined code takes up space in the execution pipeline/cache, whether it is branched to or not.

Often it makes no difference because the compiler has the final word on whether to inline or not.

Const what I can

Normally no difference at all.

No standard library bloat

What bloat?

Compiling with every g++ optimization flag I can think of

That's good: Leave most optimizations to the compiler. Only if you measured your real bottleneck, and determined if that bottleneck is significant, invest money on hand optimizing.

What you could try do is to make your code branchfree. Without using bitmasks, this may look like this:

//if (z > zhw) 
//        z -= boxWidths[2];
//else if (z < -zhw)
//      z += boxWidths[2]; 

const auto z_a[] = {
    z,
    z - boxWidths[2]
};
z = z_a[z>zhw];
...

z -= (z>zhw) * boxWidths[2];

However, there is no guarantee that this is faster. Your compiler may now have a harder time identifying SIMD spots in your code, or the branch target buffer does a good job and most of the times you have the same code paths through your function.

Upvotes: 0

Need help optimizing code (minimum image convention)

Answers (4)

Related Questions