Reputation: 103
Below is Tim Mattson's code from his series of videos on Open MP. The only change I made was to make the number of threads go to 24 since I have a 24 core machine. It is not performing nearly as well as it should and I am baffled as to why (see results below). Am I missing something here? I should mention that I am a theoretical computer scientist with experience in algorithms, but I am a bit rusty when it comes to hardware.
#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000;
double step;
int main ()
{
int i;
double x, pi, sum = 0.0;
double start_time, run_time;
step = 1.0/(double) num_steps;
for (i=1;i<=24;i++){
sum = 0.0;
omp_set_num_threads(i);
start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
printf(" num_threads = %d",omp_get_num_threads());
#pragma omp for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
}
pi = step * sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds and %d threads\n",pi,run_time,i);
}
}
I expect that with 24 cores it would be 20-24 times faster, but it is barely twice as fast. Why?! Here is the output:
num_threads = 1
pi is 3.141593 in 1.531695 seconds and 1 threads
num_threads = 2
pi is 3.141594 in 1.405237 seconds and 2 threads
num_threads = 3
pi is 3.141593 in 1.313049 seconds and 3 threads
num_threads = 4
pi is 3.141592 in 1.069563 seconds and 4 threads
num_threads = 5
pi is 3.141587 in 1.058272 seconds and 5 threads
num_threads = 6
pi is 3.141590 in 1.016013 seconds and 6 threads
num_threads = 7
pi is 3.141579 in 1.023723 seconds and 7 threads
num_threads = 8
pi is 3.141582 in 0.760994 seconds and 8 threads
num_threads = 9
pi is 3.141585 in 0.791577 seconds and 9 threads
num_threads = 10
pi is 3.141593 in 0.868043 seconds and 10 threads
num_threads = 11
pi is 3.141592 in 0.797610 seconds and 11 threads
num_threads = 12
pi is 3.141592 in 0.802422 seconds and 12 threads
num_threads = 13
pi is 3.141590 in 0.941856 seconds and 13 threads
num_threads = 14
pi is 3.141591 in 0.928252 seconds and 14 threads
num_threads = 15
pi is 3.141592 in 0.867834 seconds and 15 threads
num_threads = 16
pi is 3.141593 in 0.830614 seconds and 16 threads
num_threads = 17
pi is 3.141592 in 0.856769 seconds and 17 threads
num_threads = 18
pi is 3.141591 in 0.907325 seconds and 18 threads
num_threads = 19
pi is 3.141592 in 0.880962 seconds and 19 threads
num_threads = 20
pi is 3.141592 in 0.855475 seconds and 20 threads
num_threads = 21
pi is 3.141592 in 0.825202 seconds and 21 threads
num_threads = 22
pi is 3.141592 in 0.759689 seconds and 22 threads
num_threads = 23
pi is 3.141592 in 0.751121 seconds and 23 threads
num_threads = 24
pi is 3.141592 in 0.745476 seconds and 24 threads
So, what am I missing?
Upvotes: 0
Views: 79
Reputation: 106480
In general, with threading, there are two things to consider for speed-up:
Amdahl's law gives us some context into this. Let's be generous and assume that the portion of code which would benefit from this speedup (p) is 0.5, or half of the code. Your assertion that this would make the code 24 times as fast (letting s = 24) results in:
So in theory, you're getting 1.92 times better performance, which isn't the 24x improvement you were hoping for.
Some thoughts on this would be to analyze what parts lend itself better to high amounts of parallelization. Profile this without threads and see if the performance is better than with your current thread layout, too.
Upvotes: 0
Reputation: 32727
You have one x
variable that is shared among all threads.
While the compiler will optimize its use so that you still get the correct result (by keeping the computed value for x
in a register), that value is written out to memory every iteration. This will create stalls while cache lines are flushed and reloaded.
The fix is to declare x
within the body of the loop where you use it (double x = (i-0.5)*step;
), instead of at the top of main
.
Upvotes: 3