Reputation: 13575
I have code
#include <iostream>
#include <vector>
#include <ctime>
using namespace std;
void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = 0; i < n; ++i)
a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}
int main()
{
int m = 1001001;
vector<double> a(m), b(m), c(m), d(m), f(m);
clock_t start = std::clock();
for (int i = 0; i < 1000; ++i)
foo(1000000, &a[0], &b[0], &c[0], &d[0], &d[1], &f[0], &f[1000] );
double duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
cout << "Finished in " << duration << " seconds [CPU Clock] " << endl;
}
Can you give me a workable example to optimize it with better performance? Any compiler is fine, like Intel c++ compiler and visual c++ compiler. Please suggest a CPU with good performance to do such job.
Upvotes: 1
Views: 144
Reputation: 3264
You could experiment with prefetching the vectors into cache lines and then operating on them in lumps of 8 (8 doubles will fit into every cache line).
Make sure that while you are operating on x[i] to x[i+7] you are prefetching x[i+8] to x[i+15].
This might not help as you are using additions and multiplications which are so fast that your RAM may not be able to keep up anyway.
Upvotes: 1
Reputation: 52538
The code in question is useless. It does lots of calculations with uninitialised variables and then ignores the results. Compilers are getting more and more clever at figuring out that kind of thing and removing all the code for this. So don't be surprised if code like this doesn't take any time at all.
In C, you would declare the pointers as "const double* restrict" except a which would be double* restrict, telling the compiler that all pointers except the first one point to data that isn't going to be modified during the loop; this allows the compiler to vectorise. Not a C++ feature unfortunately afaik.
If this was your real problem, you would just swap the inner and outer loop, and remove loop invariants like this:
void foo(int iter, int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = 0; i < n; ++i) {
double xa = a [i];
double xb = b [i];
double xr = c[i] * (d[i] + e[i] + f[i] + g[i]);
for (int j = 0; j < iter; ++j)
xa = xb * xa + xr;
a [i] = xa;
}
}
You'd probably do four iterations in parallel to avoid the latency.
But in a real life situation, you would observe that in each call, you read about 40MB which is way beyond any cache. So you are limited by RAM speed. The usual solution is to split the work into smaller parts, for example 500 elements at a time, so everything fits into L1 cache, then perform the operation with the same data 1000 times.
Upvotes: 2
Reputation: 69882
On apple clang, I tried:
__restict__
on the arguments to convince the compiler that there was no aliasing.result: no change
foo()
result: computation time increased from ~3 seconds to ~18seconds!
#pragma omp parallel for
result: compiler ignored me and stayed with the original solution. ~3 seconds.
-march=native
to allow the cpu's full awesomeness to shineresult: different assembler output (vectorisation applied), but run time still unchanged at ~3s
initial conclusions:
This problem is bound by memory access and not by the CPU.
Upvotes: 1
Reputation: 17
I think you should use multithreading. change foo to get fromIndex, toIndex, instead of n and distribute vectores over threads.
void foo(int fromIndex, int toIndex, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = fromIndex; i < toIndex; ++i)
a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}
Upvotes: 0