How to optimize the following common loop?

I have code

#include <iostream>
#include <vector>
#include <ctime>
using namespace std;

void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
    for (int i = 0; i < n; ++i)
        a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}

int main()
{
    int m = 1001001;
    vector<double> a(m), b(m), c(m), d(m), f(m);

    clock_t start = std::clock();

    for (int i = 0; i < 1000; ++i)
        foo(1000000, &a[0], &b[0], &c[0], &d[0], &d[1], &f[0], &f[1000] );

    double duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout << "Finished in " << duration << " seconds [CPU Clock] " << endl;
}

Can you give me a workable example to optimize it with better performance? Any compiler is fine, like Intel c++ compiler and visual c++ compiler. Please suggest a CPU with good performance to do such job.

Upvotes: 1

Answers (4)

Olof Forshell

Reputation: 3264

You could experiment with prefetching the vectors into cache lines and then operating on them in lumps of 8 (8 doubles will fit into every cache line).

Make sure that while you are operating on x[i] to x[i+7] you are prefetching x[i+8] to x[i+15].

This might not help as you are using additions and multiplications which are so fast that your RAM may not be able to keep up anyway.

Upvotes: 1

gnasher729

Reputation: 52538

The code in question is useless. It does lots of calculations with uninitialised variables and then ignores the results. Compilers are getting more and more clever at figuring out that kind of thing and removing all the code for this. So don't be surprised if code like this doesn't take any time at all.

In C, you would declare the pointers as "const double* restrict" except a which would be double* restrict, telling the compiler that all pointers except the first one point to data that isn't going to be modified during the loop; this allows the compiler to vectorise. Not a C++ feature unfortunately afaik.

If this was your real problem, you would just swap the inner and outer loop, and remove loop invariants like this:

void foo(int iter, int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
    for (int i = 0; i < n; ++i) {
        double xa = a [i];
        double xb = b [i];
        double xr = c[i] * (d[i] + e[i] + f[i] + g[i]);

        for (int j = 0; j < iter; ++j)
            xa = xb * xa + xr;

        a [i] = xa;
    }
}

You'd probably do four iterations in parallel to avoid the latency.

But in a real life situation, you would observe that in each call, you read about 40MB which is way beyond any cache. So you are limited by RAM speed. The usual solution is to split the work into smaller parts, for example 500 elements at a time, so everything fits into L1 cache, then perform the operation with the same data 1000 times.

Upvotes: 2

Richard Hodges

Reputation: 69882

On apple clang, I tried:

using __restict__ on the arguments to convince the compiler that there was no aliasing.

result: no change

distributing the computation over 8 threads in foo()

result: computation time increased from ~3 seconds to ~18seconds!

using #pragma omp parallel for

result: compiler ignored me and stayed with the original solution. ~3 seconds.

setting the command line option -march=native to allow the cpu's full awesomeness to shine

result: different assembler output (vectorisation applied), but run time still unchanged at ~3s

initial conclusions:

This problem is bound by memory access and not by the CPU.

Upvotes: 1

Saeid Moemen

Reputation: 17

I think you should use multithreading. change foo to get fromIndex, toIndex, instead of n and distribute vectores over threads.

void foo(int fromIndex, int toIndex, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
    for (int i = fromIndex; i < toIndex; ++i)
        a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}

Upvotes: 0

How to optimize the following common loop?

Answers (4)

Related Questions