lodhb
lodhb

Reputation: 959

Profiling OpenMP-parallelized C++ code

What is the easiest way to profile a C++ program parallelized with OpenMP, on a machine on which one has no sudo rights?

Upvotes: 1

Views: 2344

Answers (1)

Alexey Alexandrov
Alexey Alexandrov

Reputation: 3129

I would recommend using Intel VTune Amplifier XE profiler.

The Basic Hotspots analysis doesn't require the root privileges and you can even install it without being in sudoers.

For OpenMP analysis it's best to compile with Intel OpenMP implementation and set environment variable KMP_FORKJOIN_FRAMES to 1 before running the profile session. This will enable the tool to visualize time regions from fork point to join point for each parallel region. This gives a good idea about where you had sufficient parallelism and where you did not. By using grid grouping like Frame Domain / Frame Type / Function you can also correlate the parallel regions with what was happening on CPUs which allows finding functions that didn't scale.

For example, imagine a simple code like below that runs some balanced work, then some serial work and then some imbalanced work calling delay() function for all of these making sure delay() doesn't inline. This imitates a real workload where all kinds of unfamiliar functions may be invoked from parallel regions making it harder to analyze whether the parallism was good or bad by looking into just hot-functions profile:

void __attribute__ ((noinline)) balanced_work() {
    printf("Starting ideal parallel\n");
#pragma omp parallel
    delay(3000000);
}
void __attribute__ ((noinline)) serial_work() {
    printf("Starting serial work\n");
    delay(3000000);
}
void __attribute__ ((noinline)) imbalanced_work() {
    printf("Starting parallel with imbalance\n");
#pragma omp parallel
    {
        int mythread = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        delay(1000000);
        printf("First barrier %d\n", mythread);
        #pragma omp barrier
        delay(mythread * 25000 + 200000);
        printf("Second barrier %d\n", mythread);
        #pragma omp barrier
        delay((nthreads - 1 - mythread) * 25000 + 200000);
        printf("Join barrier %d\n", mythread);
    }
}

int
main(int argc, char **argv)
{
    setvbuf(stdout, NULL, _IONBF, 0);

    calibrate();
    balanced_work();
    serial_work();
    imbalanced_work();

    printf("Bye bye\n");
}

For this code a typical function profile will show most of the time spent in the delay() function. On the other hand, viewing the data with frame grouping and CPU usage information in VTune will give an idea about what is serial, what is imbalanced and what is balanced. Here is what you might see with VTune:

OpenMP frames for the sample

Here one can see that:

  • There were 13.671 of elapsed time when we were executing an imbalanced region. One can see the imbalance from CPU Usage breakdown.
  • There were 3.652 of elapsed time that were pretty well balanced. There is some red time here, that’s likely some system effects - worth investigating in a real-world case.
  • And then I also have about 4 seconds of serial time. Figuring out that it’s 4 seconds is currently a bit tricky - you have to take elapsed time from summary (21.276 in my case) and subtract 13.671 and 3.652 from it yielding four. But easy enough.

Hope this helps.

Upvotes: 8

Related Questions