Reputation: 385

Optimize multiply and add

I'm using C and I have two non-negative integers n and m (both >= 0, n < 500). I need to form the product

n*(n+1)/2 + m

and this will be needed hundreds of millions of times so I want to optimize this as much as possible. My current implementation is:

inline int func(const int n, const int m) { return ( (n*(n+1) >> 1) + m); }

using inline and the >> 1 to do the divide by 2. Is there any other way to speed up this calculation?

Upvotes: 10

Answers (7)

Hansraj Das

Reputation: 249

You can use this recursive algorithm which multiplies two integers without actually using multiplication operation. Also uses minimal number of other arithmetic operations.

Note that the conventional way multiplying two numbers has complexity of O(M*N) but this function multiplies in O(log(N)), where N smaller number.

There is another algorithm to multiply two integers called karatsuba algo but i don't think you will require this as this is better suited if multiplying numbers are too large.

Upvotes: -1

Davislor

Reputation: 15144

In practice, what you want to do is write a loop that the compiler can easily and efficiently vectorize and parallelize. If you have two arrays n[i] and m[i], any modern compiler can probably figure out how to optimize n[i]*(n[i]+1)/2 + m[i] if given the proper flags. Trying to force the compiler to do optimizations on one word at a time will, in general, be counterproductive. Modern hardware is fastest when you parallelize your critical loops. If you don’t want to use non-portable intrinsics or libraries designed for the purpose, you can best achieve that by minimizing data dependencies and writing code that is easy to statically analyze.

You might or might not be able to improve the generated code with (n*n + n)/2 + m, that is, converting the polynomial to nested form. This is efficient because it enables the code generator to use just one vector register as the accumulator, maximizing the number available for SIMD. You should use restrict and alignas as appropriate to enable maximum optimization.

(Edit: Right-shift of a negative number is implementation-defined, because it might be logical or arithmetic. The code I wrote does unsigned math, which lets the compiler optimize /2 to >>1 for you. In a comment, robthebloke brings up that, if you use signed rather than unsigned variables, and you know that they will always be non-negative, the compiler might not be able to statically deduce this and therefore might not optimize /2 to >>1. In that case, you can either write >>1 or cast (uint32_t)n[i] to do better-defined unsigned math. An unsafe-math optimization flag might also re-enable this.)

This kind of vectorization is likely to be faster than individual table lookups on each element.

The result will be in the range from 0 to 125,750, which is too large for an unsigned short, and therefore the smallest type that could hold it is int32_t or uint32_t. (Or uint_least32_t if you prefer.) Using an array of the smallest type allows maximum vectorization.

If you want to help the optimizer out, you can enable OpenMP and add a #pragma omp simd, to explicitly tell the compiler to vectorize this loop. You could also use OpenMP to enable multi-threading.

In C++, you have the options of std::valarray<uint32_t> or expression templates, which are very elegant ways to express an embarrassingly-parallel computation such as this one.

The following program compiles to vectorized code on GCC, Clang or ICC when given the appropriate optimization flags. Clang compiles to a loop that computes 256 elements per iteration.

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)
typedef uint_least32_t elem_t;

const elem_t n[N];
const elem_t m[N];
elem_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}

You can attempt to add alignas specifiers to the arrays, but this will not actually cause GCC, Clang or ICC to perform aligned loads or stores. (There is a GCC extension to enable this optimization.)

If you enable the OpenMP library (-fopenmp in GCC or Clang), you can add the line

#pragma omp for

immediately before the for loop, or a more complex version, and get a loop that is both multithreaded and vectorized. If there’s a way to significantly improve on that with standard, portable C, I’d love to know about it myself.

I wrote my MWE to be simple. In real-world code, you likely want to move the entire loop, of which this inner loop is part, out of main() and into a function such as

elem_t* func( const ptrdiff_t nelems,
              const elem_t n[nelems],
              const elem_t m[nelems],
              elem_t a[nelems]
            )
{
    for ( ptrdiff_t  i = 0; i < nelems; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return a;
}

If you compare the generated assembly, you will see that it is not as efficient unless you inline it, largely because the compiler no longer knows the number of iterations at compile-time or has any information about the alignment of n, m or a.

You could also save some memory, but probably not computation time, by storing the input elements as uint16_t. The input arrays use half as much memory, but the loop cannot operate on more elements than before because the computation uses elements the same size. Be careful to cast the temporary values you use for the calculation to a type that will not overflow!

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)

const uint16_t n[N];
const uint16_t m[N];
uint32_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = ((uint32_t)n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}

Upvotes: 5

dmuir

Reputation: 4431

I think a better approach would be to ask if you really need to compute this so many times. For example if n is constant in the inner loop, could you calculate n*(n+1)/2 outside? (Though it is possible that an optimising compiler will do this anyway). Alternatively if you are incrementing n in the inner loop perhaps you could use

(n+1)*(n+2)/2 = n*(n+1)/2 + n + 1

to update n*(n+1)/2 instead of calculating it afresh each time.

Upvotes: 2

Steve Summit

Reputation: 47962

You say "this will be needed hundreds of millions of times", as if that's a lot. But these days, hundreds of millions of times is nothing.

I just wrote an obvious little program to perform n*(n+1)/2 + m 100,000,000 times. I did absolutely nothing fancy to try to make it "efficient". On an ordinary consumer-grade laptop, it ran in about half a second -- which is too fast to even time it accurately. So then I tried it for 100 times as long: 10,000,000,000 times. In that case it took about 52 seconds, which works out to about 5.2 nanoseconds per computation. (And there was some overhead involved, so the actual time per computation was even less.)

Let's say you spend one hour trying to speed this function up. (You've probably spent almost that much time just posting your question to Stack Overflow and reading responses, not to mention the time we all spent replying.) Let's say you manage to speed it up by 50% (that is, make it twice as fast). Based on my result, you would have to run the function something like 1.4e12 times (that's over a trillion times) before you got your hour back.

So if you're going to be running this computation many trillions of times (not just hundreds of millions of times), then maybe (maybe!) spend some time trying to speed it up. Otherwise -- sorry to be a downer about this -- just don't bother.

See also this answer to a somewhat similar question.

(I'm not trying to suggest that efficiency is never important, but it's also important to keep your actual situation in perspective.)

Upvotes: 0

Guillaume Petitjean

Reputation: 2718

At the end the question is: can you really optimize more than the straightforward implementation you did ?

Here is a quick test using arm-none-eabi-gcc with -O2 optimisation level: see here

int func(int n, int m) 
{ 
    return ( (n*(n+1) >> 1) + m); 
}

compiles in:

func(int, int):
        mla     r3, r0, r0, r0
        add     r0, r1, r3, asr #1
        bx      lr

So two assembly instructions (excluding the bx lr that will disappear with inlining). I don't see how you can do a quicker implementation.

EDIT: just for fun, if you compile with level -O0 here is what you got:

func(int, int):
        str     fp, [sp, #-4]!
        add     fp, sp, #0
        sub     sp, sp, #12
        str     r0, [fp, #-8]
        str     r1, [fp, #-12]
        ldr     r3, [fp, #-8]
        add     r3, r3, #1
        ldr     r2, [fp, #-8]
        mul     r3, r2, r3
        mov     r2, r3, asr #1
        ldr     r3, [fp, #-12]
        add     r3, r2, r3
        mov     r0, r3
        sub     sp, fp, #0
        ldr     fp, [sp], #4
        bx      lr

GCC can be very clever, you only have to tell him to be ;)

Upvotes: 2

XPD

Reputation: 1215

You can use direct assembly instructions. In VC++ you can use __asm keyword to start assembly section. You can use a regular function and use this section inside there. And call that function normally. For gcc based you can use asm().

Upvotes: 0

dbush

Reputation: 224207

Given that n will be less that 500, you can precompute all possible values of n*(n+1)/2 and put them in a table, then use that table to perform the computation:

int n_sum[500];

// call this once at program start
void init_sum()
{
    int i;
    for (i=0;i<500;i++) {
        n_sum[i] = i*(i+1)/2;
    }
}

inline int func(const int n, const int m)
{ 
    return n_sum[n] + m;
}

Upvotes: 8

Optimize multiply and add

Answers (7)

Related Questions