Reputation: 135
I've been trying to parallelize a nested loop as shown here:
I'm comparing the execution time of a sequential version and parallelized version of this code, but the sequential version always seems to have shorter execution times with a variety of inputs?
The inputs to the program are:
I've looked around the web and tried some things out (nowait) and nothing really changed. I'm pretty sure the parallel code is correct because I checked the outputs. Is there something wrong I'm doing here?
EDIT: Also, it seems that you can't use the reduction clause on C structures?
EDIT2: Working on gcc on linux with 2 core cpu. I have tried running this with values as high as numParticles = 40 and numTimeSteps = 100000. Maybe I should try higher?
Thanks
Upvotes: 1
Views: 2613
Reputation: 12764
I can think of two possible sources for slowdown: a) compiler made some optimizations (vectorization being first) in sequential version but not in OpenMP version, and b) thread management overhead. Both are easy to check if you also run the OpenMP version with a single thread (i.e. set numThreads to 1). If it is much slower than sequential, then (a) is the most likely reason; if it is similar to sequential and faster than the same code with 2 threads, the most likely reason is (b).
In the latter case, you may restructure the OpenMP code for less overhead. First, having two parallel regions (#pragma omp parallel) inside a loop is not necessary; you can have a single parallel region and two parallel loops inside it:
for (t = 0; t <= numTimeSteps; t++) {
#pragma omp parallel num_threads(numThreads)
{
#pragma omp for private(j)
/* The first loop goes here */
#pragma omp for
/* The second loop goes here */
}
}
Then, the parallel region can be started before the timestep loop:
#pragma omp parallel num_threads(numThreads) private(t)
for (t = 0; t <= numTimeSteps; t++) {
...
}
Each thread in the region will then run this loop, and at each iteration threads will synchronize at the end of OpenMP loops. This way, you ensure that the same set of threads run through the whole computation, no matter what OpenMP implementation is used.
Upvotes: 1
Reputation: 8401
It is possible that your loops are too small. There is overhead associated with creating a thread to process a portion of the loop so if the loop is too small a parallelized version may run slower. Another consideration is the number of cores available.
Your second omp directive is less likely to be useful because there are a lot less calculations in that loop. I would suggest to remove it.
EDIT: I tested your code with numParticles 1000 and two threads. It ran in 30 seconds. The single threaded version ran in 57 seconds. Even with numParticles 40 I see a significant speedup. This is Visual Studio 2010.
Upvotes: 1