Reputation: 53
My OpenMP Implementation shows a really bad performance. When I profile it with vtune, I have a super low CPU usage and I don't know why. Does anyone have an idea?
Hardware:
Implementation:
struct Lineitem {
int64_t l_quantity;
int64_t l_extendedprice;
float l_discount;
unsigned int l_shipdate;
};
Lineitem* array = (Lineitem*)malloc(sizeof(Lineitem) * array_length);
// array will be filled
#pragma omp parallel for num_threads(48) shared(array, array_length, date1, date2) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
if (array[i].l_shipdate >= date1 && array[i].l_shipdate < date2 &&
array[i].l_discount >= 0.08f && array[i].l_discount <= 0.1f &&
array[i].l_quantity < 24)
{
sum += (array[i].l_extendedprice * array[i].l_discount);
}
}
Additionally as information, I am using cmake and clang.
Upvotes: 0
Views: 751
Reputation: 53
I was able to find the cause of my poor OpenMP performance. I am running my OpenMP code inside a thread pinned to a core. If I don't pin the thread to a core, then the OpenMP code is fast.
Probably the threads created by OpenMP in the pinned thread are also executed on the core where the pinned thread is pinned. Consequently, the whole OpenMP code runs on only one core with many threads.
Upvotes: 1
Reputation: 5810
Modern CPUs will only show high performance if there is lots of cache data to be reused. Since you are only operating linearly on an array, there is no such thing and you are limited by memory bandwdith. Your cores will indeed be operating at a small fraction of their full utilization.
Things may be even worse: you have an array of structures from which you use certain fields. If there are other fields that you don't use, you get the phenomenon that you don't fully use the cachelines that you load from memory, dividing the performance yet again by a factor. Please amend your question by including the data layout of your structure/class.
Upvotes: 0