How to statistically measure OpenMP performance within a program?

Question

I want to statistically measure performance of a program that I have parallelized with OpenMP. I opted for writing a loop within a testing application that executes the parallel algorithm MAX_EXPERIMENTS times and reports the time measurements into a file.

The problem solution seems to be more complex than extracting the parallel pragma above the outer loop, as I have serial parts of the code inbetween inner parallel loops.

The code:

#include 
#include  
#include 
#include 
#include 
#include 
#include 
#include 

using namespace std;

int main()
{
    const int MAX_NUMBERS = 1e07; 
    const int MAX_EXPERIMENTS = 1e02; 

    std::random_device rd;  
    std::mt19937 gen(rd()); 
    std::bernoulli_distribution dis(0.1);

    vector numbers;
    numbers.reserve(MAX_NUMBERS); 

    for(unsigned int i = 0; i < MAX_NUMBERS; ++i)
    {
        if (dis(gen))
            numbers.emplace_back(100); 
        else 
            numbers.emplace_back(1);
    }

    stringstream ss; 
    ss << "time-measurements-nthread-" << setfill('0') << setw(2) 
        << omp_get_max_threads() << ".csv"; 

    ofstream exp(ss.str()); 
    exp << "time
"; 

    for (unsigned int i = 0; i < MAX_EXPERIMENTS; ++i)
    {
        // BEGIN: Tested parallel program
        double t0 = omp_get_wtime();  

        // Some serial work.

        double x = 0; 
        //#pragma omp parallel for schedule(dynamic) reduction(+:x)  // exp-01
        #pragma omp parallel for schedule(static) reduction(+:x)  // exp-02
        for(unsigned int i = 0; i < numbers.size(); ++i) 
        {
            if (numbers[i] > 1)
               x = x + cos(numbers[i]); // Some work being done.
        }
        double t1 = omp_get_wtime(); 

        // Some serial work

        // Measure program execution
        exp << t1 - t0 << "
";

        // END: Tested parallel program
    }

};

The program first serially initializes 1e07 numbers to either 1 or 100 such that the probability of hitting 100 is 10%, which matches my real world input data.

The main testing loop performs 100 experiments, and the loop body models theh tested parallel algorithm. There are parts of the parallel algorithm that must be executed in serial. Writing pragma omp parallel for within a loop is supposed to be a bad idea, as it will open and close threads each time an experiment is being created.

Question 1: even though usually one would avoid opening parallel regions within loops, is this justified in this case, where each experimental loop step represents an independent parallel program execution, and the preparation of input data for the experiment is much faster at runtime?

To visualize the written data, I've used python (jupyter nootebook code):

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import os

rcParams["figure.figsize"] = [10,20]
rcParams["font.size"] = 24 

def plot_experiment(expattern):
    thread1df = pd.read_csv("time-measurements-nthread-01-%s.csv" % expattern)
    thread2df = pd.read_csv("time-measurements-nthread-02-%s.csv" % expattern)
    thread4df = pd.read_csv("time-measurements-nthread-04-%s.csv" % expattern)
    thread8df = pd.read_csv("time-measurements-nthread-08-%s.csv" % expattern)


    f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)

    ax1.plot(thread1df["time"], label="time 1", color='g')
    ax1.plot(thread2df["time"], label="time 2", color='r')
    ax1.plot(thread4df["time"], label="time 4", color='b')
    ax1.plot(thread8df["time"], label="time 8", color='k')

    ax2.plot(thread1df["time"] / thread8df["time"], label="speedup 8", color='k')
    ax2.plot(thread1df["time"] / thread4df["time"], label="speedup 4", color='b')
    ax2.plot(thread1df["time"] / thread2df["time"], label="speedup 2", color='r')

    ax1.set_ylabel("Time in seconds")
    ax1.legend()

    ax2.set_xlabel("Test number")
    ax2.legend()
    ax2.set_ylabel("Speedup")


plot_experiment("exp-01")

And the application is compiled with gcc, using optimization: g++ -std=c++1y -fopenmp -O3 main.cpp -o main

The experiment is executed with for i in 1 2 4 8; do export OMP_NUM_THREADS=$i && ./main && sleep 5; done;

And the experiment files are then re-named for pandas using for file in *nthread-0[0-9].csv*; do mv $file ${file/.csv/-exp-02.csv}; done (replace exp-02 with exp-01 for the first experiment).

In the first experiment, I've tried dynamic scheduling and got the following diagrams:

And that's weird because it seems that adding threads slows down the program. Checking the bottlenecks with HPCToolkit for exp-01 and 8 threads, I've noticed that OpenMP spends a ton of time in switching in the dynamic scheduling mode:

So I switched the scheduling mode to static and re-ran the experiments, then got the following:

Now there is some scaling there, at least for 2 threads, but now 4 threads are oscillating too much and there is not much effect in using 8 threads. I checked it using HPCToolkit again and got this:

I think this is telling me that starting and stopping threads is eating up 85% of my runtime with 8 threads, however the HPCToolkit manual states

Furthermore, if the filtered nodes are children of a “fake” procedures (such as program_root and thread_root), the exclusive metrics in callers view and flat view can be misleading.

Question 2: Does the experiment 02 have a significant overhead in opening and closing the parallel region within the experiment loop? If so, how to go around this, taking into account the serial parts of the algorithm?

Software: Arch Linux, g++ (GCC) 7.1.1 20170630, hpcrun: A member of HPCToolkit, version 2017.11, CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz

Edit

I have tried changing the thread persistency behavior using the environmental variables as proposed in the answer to this question:

export OMP_WAIT_POLICY=active GOMP_SPINCOUNT=infinite

here are the results:

Obviously, the oscillations caused by the thread creation / destruction are much lower, but are they gone? Is there a way to change the program so that I don't have to rely on spinning threads? Investigating bottlenecks of this program will still show substantial amount of CPU cycles being spent by spinning threads.

How to statistically measure OpenMP performance within a program?

Answers (1)

Related Questions