Reputation: 59
I am currently trying to improve parallel performance on my Code and I am still new to OpenMP. I have to iterate over a large container, in each iteration reading from multiple entries and writing a result to a single entry. Below is a very minmal Code example of what I am trying to do.
data
is a pointer to an array, where a lot of datapoints are stored. Before the parallel region I create an Array newData
, so can use data
as read-only and newData
as write-only, afterwards I throw the old data
away and use newData
for further calculations.
To my understanding data
and newData
are shared between threads and everything declared inside the parallel region is private.
Can reading from data
by multiple threads cause performance issues?
I am using #critical
for assigning a new value to an element of newData
to avoid race conditions. Is this necessary, since I access every element of newData
only once and never by multiple threads?
Also I am not sure about scheduling. Do I have to specify if I want a static
or dynamic
schedule? Can I use nowait
since all threads are idependent of each other?
array *newData = new array;
omp_set_num_threads (threads);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < range; i++)
{
double middle = (*data)[i];
double previous = (*data)[i-1];
double next = (*data)[i+1];
double new_value = (previous + middle + next) / 3.0;
#pragma omp critical(assignment)
(*newData)[i] = new_value;
}
}
delete data;
data = newData;
I am aware that in the first and last iteration previous
and next
can not be read from data
, in the real code this is taken care of but for this minimal example you get the idea of reading multiple times from data
.
Upvotes: 2
Views: 1567
Reputation: 5829
First of all, get rid of all unnecessary dependencies. #pragma omp critical(assignment)
is not necessary because each index of (*newData)
is only written to once per loop, so there's no race condition.
Your code could now look like this:
#pragma omp parallel for
for (int i = 0; i < range; i++)
(*newData)[i] = ((*data)[i-1] + (*data)[i] + (*data)[i+1]) / 3.0;
Now we're looking for bottlenecks. The list of potential candidates I came up with is this:
So let's analyze them further.
Slow division:
It takes some CPUs forever to calculate double/double. To know how long and what througput your CPU has, you have to look at its specs. Maybe replacing /3.0
with *0.3333..
might help, but maybe your compiler does this already. Using extended instruction sets (like SSE/AVX) you might shedule several divisions/multiplications at once.
Cache thrashing:
Because your CPU has to load/store one cache line at a time there could be conflicts. Imagine if thread 1 tries to write to (*newdata)[1] and thread 2 to (*newdata)[2] and they are on the same cache line. Now one of them has to wait for the other. You could resolve this with #pragma omp parallel for schedule(static, 64)
.
ILP: CPUs can schedule multiple operations into a pipeline if the operations are independent. For this to happen you have to unroll your loop. This could look like this:
assert(range % 4 == 0);
#pragma omp parallel for
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) / 3.0;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) / 3.0;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) / 3.0;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) / 3.0;
}
Memory bandwith limitations: For your very simple loop think about this. How much memory do you have to load and how long will your CPU be busy processing it. You're loading about 1 cache line and computing some dereferences, some pointer addition, two additions and one division. Which limit you hit depends on your CPU specs. Now consider cache locality. Can you modify your code to make better use of the cache? If one thread gets i=3 in one loop-iteration, and i=7 in the next, you have to reload 3 (*data)'s. But if you would go from i=3 to i=4, you might not have to load anything, because (*data)[i+1] was in the cacheline previously loaded. You save some RAM bandwith. To make use of this, unroll the loop. Also using float instead of double increases this chance.
Hidden dependencies:
Now this part I personally find very tricky. Sometimes your compiler isn't shure it can reuse some data, because it doesn't know it hasn't changed. Using const
helps the compiler. But sometimes you need a restrict
to give the compiler the right hint. But I don't understand this well enough to explain it.
So here is what I would try:
const double ONETHIRD = 1.0 / 3.0;
assert(range % 4 == 0);
#pragma omp parallel for schedule(static, 1024)
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) * ONETHIRD;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) * ONETHIRD;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) * ONETHIRD;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) * ONETHIRD;
}
And then benchmark. Benchmark some more, and benchmark some more. Only benchmarks will show you which tricks help.
PS: One more thing to consider. If you see your program hitting the memory bandwith hard. You could consider changing the algorithm. Maybe fuse two steps into one. Like going from
b[i] := (a[i-1] + a[i] + a[i+1]) / 3.0
to
d[i] := (n[i-1] + n[i] + n[i+1]) / 3.0 = (a[i-2] + 2.0 * a[i-1] + 3.0 * a[i] + 2.0 * a[i+1] + a[i+1]) / 3.0
. I think the reason for this you will find out yourself.
Have fun optimizing ;-)
Upvotes: 5
Reputation: 741
I assume you are trying to do some kind of convolution or median blur with 1D array. The short answer is: stick to default schedule strategy, and get rid of critical at all.
As I can tell, you are a quit newbie to parallelism, it's a little bit confusion to deal with OpenMP directives, like nowait/private/reduction/critical/atomic/single, etc. I think what you need is a well written textbook to clarify various concept. If you had a sound knowledge, a hour of learning OpenMP could be enough to deal with most daily programming.
Upvotes: 0
Reputation: 63
Upvotes: 2