Lightsout
Lightsout

Reputation: 3757

How to optimize this math operation for speed

I'm trying to optimize a function taking a good chunk of execution time, which computes the following math operation many times. Is there anyway to make this operation faster?

float total = (sqrt(
          ((point_A[j].length)*(point_A[j].length))+
          ((point_B[j].width)*(point_B[j].width))+
          ((point_C[j].height)*(point_C[j].height))
                                                        ));

Upvotes: 1

Views: 473

Answers (4)

Toby Speight
Toby Speight

Reputation: 30879

Your question could be improved by adding a little more context. Is your code required to be portable, or are you targeting a particular compiler, or a specific processor or processor family? Perhaps you're willing to accept a general baseline version with target-specific optimised versions selected at runtime?

Also, there's very little context for the line of code you give. Is it in a tight loop? Or is it scattered in a bunch of places in conditional code in such a loop?

I'm going to assume that it's in a tight loop thus:

for (int j=0;  j<total;  ++j)
    length[j] = sqrt(
      (point_A[j].length)*(point_A[j].length) +
      (point_B[j].width)*(point_B[j].width) +
      (point_C[j].height)*(point_C[j].height));

I'm also going to assume that your target processor is multi-core, and that the arrays are distinct (or that the relevant elements are distinct), then an easy win is to annotate for OpenMP:

#pragma omp parallel for
for (int j=0;  j<total;  ++j)
    length[j] = sqrt((point_A[j].length)*(point_A[j].length) +
                     (point_B[j].width)*(point_B[j].width) +
                     (point_C[j].height)*(point_C[j].height));

Compile with g++ -O3 -fopenmp -march=native (or substitute native with your desired target processor architecture).

If you know your target, you might be able to benefit from parallelisation of loops with the gcc flag -ftree-parallelize-loops=n - look in the manual.

Now measure your performance change (I'm assuming that you measured the original, given that this is an optimisation question). If it's still not fast enough for you, then it's time to consider changing your data structures, algorithms, or individual lines of code.

Upvotes: 1

user555045
user555045

Reputation: 64904

Rearrange the data into this:

float *pointA_length;
float *pointB_width;
float *pointC_height;

That may require some level of butchering of your data structures, so you'll have to choose whether it's worth it or not.

Now what we can do is write this:

void process_points(float* Alengths, float* Bwidths, float* Cheights,
                    float* output, int n)
{
    for (int i = 0; i < n; i++) {
       output[i] = sqrt(Alengths[i] * Alengths[i] +
                        Bwidths[i]  * Bwidths[i]  +
                        Cheights[i] * Cheights[i]);
  }
}

Writing it like this allows it to be auto-vectorized. For example, GCC targeting AVX and with -fno-math-errno -ftree-vectorize, can vectorize that loop. It does that with a lot of cruft though. __restrict__ and alignment attributes only improve that a little. So here's a hand-vectorized version as well: (not tested)

void process_points(float* Alengths,
                    float* Bwidths,
                    float* Cheights,
                    float* output, int n)
{
    for (int i = 0; i < n; i += 8) {
        __m256 a = _mm256_load_ps(Alengths + i);
        __m256 b = _mm256_load_ps(Bwidths + i);
        __m256 c = _mm256_load_ps(Cheights + i);
        __m256 asq = _mm256_mul_ps(a, a);
        __m256 sum = _mm256_fmadd_ps(c, c, _mm256_fmadd_ps(b, b, asq));
        __m256 hsum = _mm256_mul_ps(sum, _mm256_set1_ps(0.5f));
        __m256 invsqrt = _mm256_rsqrt_ps(sum);
        __m256 s = _mm256_mul_ps(invsqrt, invsqrt);
        invsqrt = _mm256_mul_ps(sum, _mm256_fnmadd_ps(hsum, s, _mm256_set1_ps(1.5f)));
        _mm256_store_ps(output + i, _mm256_mul_ps(sum, invsqrt));
    }
}

This makes a number of assumptions:

  • all pointers are 32-aligned.
  • n is a multiple of 8, or at least the buffers have enough padding that they're never accessed out of bounds.
  • the input buffers are not aliased with the output buffer (they could be aliased among themselves, but .. why)
  • the slightly reduced accuracy of the square root computed this way is OK (accurate to approximately 22 bits, instead of correctly rounded).
  • the sum of squares computed with fmadd can be slightly different than if it's computed using multiplies and adds, I assume that's OK too
  • your target supports AVX/FMA so this will actually run

The method for computing the square root I used here is using an approximate reciprocal square root, an improvement step (y = y * (1.5 - (0.5 * x * y * y))) and then a multiplication by x because x * 1/sqrt(x) = x/sqrt(x) = sqrt(x).

Upvotes: 2

gugu69
gugu69

Reputation: 156

You can eventually try to optimize the sqrt function itself. May I suggest you to have a look at this link: Best Square Root Method

Upvotes: 1

Neeraj
Neeraj

Reputation: 51

If memory is cheap then you could do the following thereby improving CPU cache hit rate. Since you haven't posted more details, so I will make some assumptions here.

long tmp_len_square[N*3];

for (int j = 0; j < N; ++j) {
    tmp_len_square[3 * j] = (point_A[j].length)*(point_A[j].length);
}

for (int j = 0; j < N; ++j) {
    tmp_len_square[(3 * j) + 1] = (point_B[j].width)*(point_B[j].width);
}

for (int j = 0; j < N; ++j) {
    tmp_len_square[(3 * j) + 2] = (point_C[j].height)*(point_C[j].height);
}

for (int j = 0; j < N; ++j) {
    float total = sqrt(tmp_len_square[3 * j] +
                       tmp_len_square[(3 * j) + 1] +
                       tmp_len_square[(3 * j) + 2]);
    // ...
}

Upvotes: 2

Related Questions