Using rand_r in OpenMP 'for' is slower with 2 threads

Question

The following code performs better with 1 thread than with 2 (using 4 threads gives speed up, though):

#include 
#include 
#include 

int main(int argc, char **argv) {
  int n = atoi(argv[1]);
  int num_threads = atoi(argv[2]);
  omp_set_num_threads(num_threads);

  unsigned int *seeds = malloc(num_threads * sizeof(unsigned int));
  for (int i = 0; i < num_threads; ++i) {
    seeds[i] = 42 + i;
  }

  unsigned long long sum = 0;
  double begin_time = omp_get_wtime();
  #pragma omp parallel
  {
    unsigned int *seedp = &seeds[omp_get_thread_num()];
    #pragma omp for reduction(+ : sum)
    for (int i = 0; i < n; ++i) {
      sum += rand_r(seedp);
    }
  }
  double end_time = omp_get_wtime();

  printf("%fs
", end_time - begin_time);
  free(seeds);
  return EXIT_SUCCESS;
}

On my laptop (2 cores, HT enabled) I get the following results:

$ gcc -fopenmp test.c && ./a.out 100000000 1
0.821497s
$ gcc -fopenmp test.c && ./a.out 100000000 2
1.096394s
$ gcc -fopenmp test.c && ./a.out 100000000 3
0.933494s
$ gcc -fopenmp test.c && ./a.out 100000000 4
0.748038s

The problem persists without reduction, drand48_r brings no difference, dynamic scheduling makes things even worse. However, if I replace the body of the loop with something not connected with random, i. e. sum += *seedp + i;, everything works as expected.

Gilles · Accepted Answer

This is textbook example of false sharing. By using an array of seeds upon which each thread take one element, you force the logically private variables to be physically located next to each-other in memory. Therefore, the are all in the same cache line. This means that although no thread tries to modify a some other thread's seed, the cache line itself is modified by each threads at each iteration. And the actual trouble is that the system cannot detect variable's modifications for cache coherency, only cache line modifications. Therefore, at each iteration for each thread, the cache line has been modified by another thread and is no longer valid from a system's point of view. It has to be reloaded from memory (well, most likely from shared L3 cache here), leading to slowing down your code.

Try this one instead (not tested):

#include 
#include 
#include 

int main(int argc, char **argv) {
  int n = atoi(argv[1]);
  int num_threads = atoi(argv[2]);
  omp_set_num_threads(num_threads);

  unsigned long long sum = 0;
  double begin_time = omp_get_wtime();
  #pragma omp parallel
  {
    unsigned int seed = 42 + omp_get_thread_num();
    #pragma omp for reduction(+ : sum)
    for (int i = 0; i < n; ++i) {
      sum += rand_r(&seed);
    }
  }
  double end_time = omp_get_wtime();

  printf("%fs
", end_time - begin_time);
  return EXIT_SUCCESS;
}

Using rand_r in OpenMP 'for' is slower with 2 threads

Answers (1)

Related Questions

Using rand_r in OpenMP &#39;for&#39; is slower with 2 threads

Answers (1)

Related Questions

Using rand_r in OpenMP 'for' is slower with 2 threads