Reputation: 959
What is the easiest way to profile a C++ program parallelized with OpenMP, on a machine on which one has no sudo rights?
Upvotes: 1
Views: 2344
Reputation: 3129
I would recommend using Intel VTune Amplifier XE profiler.
The Basic Hotspots analysis doesn't require the root privileges and you can even install it without being in sudoers.
For OpenMP analysis it's best to compile with Intel OpenMP implementation and set environment variable KMP_FORKJOIN_FRAMES to 1 before running the profile session. This will enable the tool to visualize time regions from fork point to join point for each parallel region. This gives a good idea about where you had sufficient parallelism and where you did not. By using grid grouping like Frame Domain / Frame Type / Function you can also correlate the parallel regions with what was happening on CPUs which allows finding functions that didn't scale.
For example, imagine a simple code like below that runs some balanced work, then some serial work and then some imbalanced work calling delay() function for all of these making sure delay() doesn't inline. This imitates a real workload where all kinds of unfamiliar functions may be invoked from parallel regions making it harder to analyze whether the parallism was good or bad by looking into just hot-functions profile:
void __attribute__ ((noinline)) balanced_work() {
printf("Starting ideal parallel\n");
#pragma omp parallel
delay(3000000);
}
void __attribute__ ((noinline)) serial_work() {
printf("Starting serial work\n");
delay(3000000);
}
void __attribute__ ((noinline)) imbalanced_work() {
printf("Starting parallel with imbalance\n");
#pragma omp parallel
{
int mythread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
delay(1000000);
printf("First barrier %d\n", mythread);
#pragma omp barrier
delay(mythread * 25000 + 200000);
printf("Second barrier %d\n", mythread);
#pragma omp barrier
delay((nthreads - 1 - mythread) * 25000 + 200000);
printf("Join barrier %d\n", mythread);
}
}
int
main(int argc, char **argv)
{
setvbuf(stdout, NULL, _IONBF, 0);
calibrate();
balanced_work();
serial_work();
imbalanced_work();
printf("Bye bye\n");
}
For this code a typical function profile will show most of the time spent in the delay() function. On the other hand, viewing the data with frame grouping and CPU usage information in VTune will give an idea about what is serial, what is imbalanced and what is balanced. Here is what you might see with VTune:
Here one can see that:
Hope this helps.
Upvotes: 8