Neo
Neo

Reputation: 1554

Parallel speedup with OpenMP

I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time).

Scenario 1:

Sequential time measurement:

startTime=omp_get_wtime();  
for loop computation  
endTime=omp_get_wtime();  
seq_time = endTime-startTime;

Parallel time measurement:

startTime = omp_get_wtime();  
for loop computation (#pragma omp parallel for reduction (+:pi) private (i)  
for (blah blah) {   
    computation;   
}  
endTime=omp_get_wtime();  
paralleltime = endTime-startTime; 

speedup = seq_time/paralleltime;

Scenario 2:

Sequential time measurement:

for loop{  
startTime=omp_get_wtime();  
   computation;  
endTime=omp_get_wtime();  
seq_time += endTime-startTime;  
}

Parallel time measurement:

for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime)  
for (blah blah) {  
    startTime=omp_get_wtime();  
    computation;  
    endTime=omp_get_wtime();  
    paralleltime = endTime-startTime;  
}  

speedup = seq_time/paralleltime;

I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup. But Scenario 1 considers the overhead involved in spawning and managing threads.

My doubt is this: With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations. I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?

Thanks! And sorry about the rather long post.

Upvotes: 5

Views: 6410

Answers (4)

minjang
minjang

Reputation: 9050

First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead.


In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltime[NUM_THREADS] and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding). Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead).


No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e., excluding parallel overhead) are:

  1. Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case).
  2. Synchronization cost: if there are any kind of synchronizations (e.g., mutex, event, barrier), you may have waiting/block time.
  3. Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e., private L1/L2 caches).

In the scenario 1, it will include the overhead of parallel libraries:

  1. Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct.
  2. Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g., implicit barrier).
  3. Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.

I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above.

Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g., scheduling overhead).

Upvotes: 5

phkahler
phkahler

Reputation: 5767

There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful. What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.

Upvotes: 0

Schnouki
Schnouki

Reputation: 7707

What do you want to measure exactly? The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better.

Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account. In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.

Otherwise, Jonathan Dursi's answer is excellent.

Upvotes: 2

Jonathan Dursi
Jonathan Dursi

Reputation: 50927

There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.

More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.

If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.

Upvotes: 5

Related Questions