C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

Question

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.

#include 
#include 
#include 

// Count the number of bits
inline int bitCount(uint64_t n){

  int count = 0;

  while(n){

    n &= (n-1);
    count++;

  }

  return count;

}


void test(){

  int nthreads = omp_get_max_threads();
  omp_set_dynamic(0);
  omp_set_num_threads(nthreads);

  // I need a priority queue per thread
  std::vector > mat(nthreads, std::vector(1000,-INFINITY));
  std::vector vals(100,1);

  # pragma omp parallel for shared(mat,vals)
  for(int i = 0; i < 100000000; i++){
    std::vector &tid_vec = mat[omp_get_thread_num()];
    int total_count = 0;
    for(unsigned int j = 0; j < vals.size(); j++){
      total_count += bitCount(vals[j]);
      tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
    }
  }

}

This code runs in about 11 seconds. If I comment out the following line:

tid_vec[j] = total_count;

the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?

Jorge Bellon · Accepted Answer

Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.

If you comment the line:

tid_vec[j] = total_count;

The compiler will optimize away all the computations whose result is not used. Therefore:

  total_count += bitCount(vals[j]);

is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.

On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.

As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.

# pragma omp parallel shared(mat,vals)
{
std::vector local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
  int total_count = 0;
  for(unsigned int j = 0; j < vals.size(); j++){
    total_count += bitCount(vals[j]);
    local_vec[j] = total_count;
  }
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

Answers (1)

Related Questions