Reputation: 5497
I want to statistically measure performance of a program that I have parallelized with OpenMP. I opted for writing a loop within a testing application that executes the parallel algorithm MAX_EXPERIMENTS
times and reports the time measurements into a file.
The problem solution seems to be more complex than extracting the parallel pragma above the outer loop, as I have serial parts of the code inbetween inner parallel loops.
The code:
#include <omp.h>
#include <vector>
#include <random>
#include <iostream>
#include <cmath>
#include <fstream>
#include <sstream>
#include <iomanip>
using namespace std;
int main()
{
const int MAX_NUMBERS = 1e07;
const int MAX_EXPERIMENTS = 1e02;
std::random_device rd;
std::mt19937 gen(rd());
std::bernoulli_distribution dis(0.1);
vector<double> numbers;
numbers.reserve(MAX_NUMBERS);
for(unsigned int i = 0; i < MAX_NUMBERS; ++i)
{
if (dis(gen))
numbers.emplace_back(100);
else
numbers.emplace_back(1);
}
stringstream ss;
ss << "time-measurements-nthread-" << setfill('0') << setw(2)
<< omp_get_max_threads() << ".csv";
ofstream exp(ss.str());
exp << "time\n";
for (unsigned int i = 0; i < MAX_EXPERIMENTS; ++i)
{
// BEGIN: Tested parallel program
double t0 = omp_get_wtime();
// Some serial work.
double x = 0;
//#pragma omp parallel for schedule(dynamic) reduction(+:x) // exp-01
#pragma omp parallel for schedule(static) reduction(+:x) // exp-02
for(unsigned int i = 0; i < numbers.size(); ++i)
{
if (numbers[i] > 1)
x = x + cos(numbers[i]); // Some work being done.
}
double t1 = omp_get_wtime();
// Some serial work
// Measure program execution
exp << t1 - t0 << "\n";
// END: Tested parallel program
}
};
The program first serially initializes 1e07
numbers to either 1
or 100
such that the probability of hitting 100
is 10%
, which matches my real world input data.
The main testing loop performs 100
experiments, and the loop body models theh tested parallel algorithm. There are parts of the parallel algorithm that must be executed in serial. Writing pragma omp parallel for
within a loop is supposed to be a bad idea, as it will open and close threads each time an experiment is being created.
Question 1: even though usually one would avoid opening parallel regions within loops, is this justified in this case, where each experimental loop step represents an independent parallel program execution, and the preparation of input data for the experiment is much faster at runtime?
To visualize the written data, I've used python (jupyter nootebook code):
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import os
rcParams["figure.figsize"] = [10,20]
rcParams["font.size"] = 24
def plot_experiment(expattern):
thread1df = pd.read_csv("time-measurements-nthread-01-%s.csv" % expattern)
thread2df = pd.read_csv("time-measurements-nthread-02-%s.csv" % expattern)
thread4df = pd.read_csv("time-measurements-nthread-04-%s.csv" % expattern)
thread8df = pd.read_csv("time-measurements-nthread-08-%s.csv" % expattern)
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
ax1.plot(thread1df["time"], label="time 1", color='g')
ax1.plot(thread2df["time"], label="time 2", color='r')
ax1.plot(thread4df["time"], label="time 4", color='b')
ax1.plot(thread8df["time"], label="time 8", color='k')
ax2.plot(thread1df["time"] / thread8df["time"], label="speedup 8", color='k')
ax2.plot(thread1df["time"] / thread4df["time"], label="speedup 4", color='b')
ax2.plot(thread1df["time"] / thread2df["time"], label="speedup 2", color='r')
ax1.set_ylabel("Time in seconds")
ax1.legend()
ax2.set_xlabel("Test number")
ax2.legend()
ax2.set_ylabel("Speedup")
plot_experiment("exp-01")
And the application is compiled with gcc, using optimization: g++ -std=c++1y -fopenmp -O3 main.cpp -o main
The experiment is executed with for i in 1 2 4 8; do export OMP_NUM_THREADS=$i && ./main && sleep 5; done;
And the experiment files are then re-named for pandas using for file in *nthread-0[0-9].csv*; do mv $file ${file/.csv/-exp-02.csv}; done
(replace exp-02
with exp-01
for the first experiment).
In the first experiment, I've tried dynamic scheduling and got the following diagrams:
And that's weird because it seems that adding threads slows down the program. Checking the bottlenecks with HPCToolkit for exp-01
and 8
threads, I've noticed that OpenMP spends a ton of time in switching in the dynamic
scheduling mode:
So I switched the scheduling mode to static
and re-ran the experiments, then got the following:
Now there is some scaling there, at least for 2
threads, but now 4
threads are oscillating too much and there is not much effect in using 8
threads. I checked it using HPCToolkit
again and got this:
I think this is telling me that starting and stopping threads is eating up 85%
of my runtime with 8
threads, however the HPCToolkit manual states
Furthermore, if the filtered nodes are children of a “fake” procedures (such as program_root and thread_root), the exclusive metrics in callers view and flat view can be misleading.
Question 2: Does the experiment 02 have a significant overhead in opening and closing the parallel region within the experiment loop? If so, how to go around this, taking into account the serial parts of the algorithm?
Software: Arch Linux, g++ (GCC) 7.1.1 20170630, hpcrun: A member of HPCToolkit, version 2017.11, CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
Edit
I have tried changing the thread persistency behavior using the environmental variables as proposed in the answer to this question:
export OMP_WAIT_POLICY=active GOMP_SPINCOUNT=infinite
here are the results:
Obviously, the oscillations caused by the thread creation / destruction are much lower, but are they gone? Is there a way to change the program so that I don't have to rely on spinning threads? Investigating bottlenecks of this program will still show substantial amount of CPU cycles being spent by spinning threads.
Upvotes: 4
Views: 1018
Reputation: 22670
From the discussion in the comments it seem that your main problem is that you have a complex existing application and want to place a workshared loop in some inner part. But creating all threads only there is too much overhead in your application, thread pooling by libgomp seems insufficient.
If you want to do this without restructuring, it might help to use taskloop
, which works similar to for
, but can be nested within a single
portion. In turn it may not be as efficient as `for. Essentially your code would look like this:
int a;
#pragma omp parallel
{
int b;
#pragma omp single
{
int c;
// lots of serial code
// somewhere inbetween
#pragma omp taskloop
for (...) {
int d;
}
// lots of serial code
}
}
Note that data-sharing for task generation constructs works a bit differently. By default a variable declared outside of the parallel region (a
) is shared
within the parallel
region and also shared among the tasks executing the inner loop. A variable declared inside the parallel region, but outside the taskloop
(b
,c
), is private
within the parallel region and firstprivate
- that is each thread has it's own copy that is initialized with the outer value. Finally d
is just local to each loop iteration.
Edit: Do not put any barriers in. Due to implicit taskgroup, the serial portion and the tasks are isolated in their execution.
Upvotes: 2