Code with threads taking longer than without

Question

I'm experimenting with threads in C, and I'm confused upon some results.

I have the following loop:

for(size_t i=0;i<1000000000;i++){
  a++;
}

It increments a global variable a. I've done this with 6 variables, a to e.

First, I incremented the variables consecutively in main:

#include
#include

size_t a,b,c,d,e,f;

int main(void){
  for(size_t i=0;i<1000000000;i++){
    a++;
  }
  for(size_t i=0;i<1000000000;i++){
    b++;
  }
  for(size_t i=0;i<1000000000;i++){
    c++;
  }
  for(size_t i=0;i<1000000000;i++){
    d++;
  }
  for(size_t i=0;i<1000000000;i++){
    e++;
  }
  for(size_t i=0;i<1000000000;i++){
    f++;
  }
  size_t abcdef=a+b+c+d+e+f;
  printf("%zu
",abcdef);
  return 0;
}

Then, when testing the program with time, the following results were given:

6000000000

real    0m11.450s
user    0m11.446s
sys     0m0.000s

I'd expect the results using pthreads to be quite a bit faster:

#include
#include

size_t a,b,c,d,e,f;

void *t1(void *args){
  for(size_t i=0;i<1000000000;i++){
    a++;
  }
  return NULL;
}

void *t2(void *args){
  for(size_t i=0;i<1000000000;i++){
    b++;
  }
  return NULL;
}

void *t3(void *args){
  for(size_t i=0;i<1000000000;i++){
    c++;
  }
  return NULL;
}

void *t4(void *args){
  for(size_t i=0;i<1000000000;i++){
    d++;
  }
  return NULL;
}

void *t5(void *args){
  for(size_t i=0;i<1000000000;i++){
    e++;
  }
  return NULL;
}

void *t6(void *args){
  for(size_t i=0;i<1000000000;i++){
    f++;
  }
  return NULL;
}

int main(void){
  pthread_t p1,p2,p3,p4,p5,p6;
  pthread_create(&p1,NULL,t1,NULL);
  pthread_create(&p2,NULL,t2,NULL);
  pthread_create(&p3,NULL,t3,NULL);
  pthread_create(&p4,NULL,t4,NULL);
  pthread_create(&p5,NULL,t5,NULL);
  pthread_create(&p6,NULL,t6,NULL);
  pthread_join(p1,NULL);
  pthread_join(p2,NULL);
  pthread_join(p3,NULL);
  pthread_join(p4,NULL);
  pthread_join(p5,NULL);
  pthread_join(p6,NULL);
  size_t abcdef=a+b+c+d+e+f;
  printf("%zu
",abcdef);
  return 0;
}

However, the results were quite unexpecting:

6000000000

real    0m14.521s
user    1m26.048s
sys     0m0.014s

Not only is the real time larger, which I'd expect to be lower, but the user time is over 1 minute, which, I did not wait a minute.

What is happening here? How can I solve it?

Marthinwurer · Accepted Answer

The issue that you're running into here is due to cache coherency.

In modern processors, the actual minimum amount of memory that a single core can access at a time is a full cache line, which on many modern processors is 64 bytes. That means with every increment of each variable, 64 bytes are read, and 8 bytes of that are modified for the increment. The other 56 bytes are just along for the ride.

However, if any of those other bytes need to be modified by another core, they have to use a cache coherence protocol to ensure that they don't corrupt each other's memory. When a cache line is written to, it will be marked as modified, and each other cache will have to mark it as invalid and reload it to use it again.

When you define your variables in your code as:

size_t a,b,c,d,e,f;

they are all lined up in memory as one contiguous block, which will end up being less than a full cache line. That means that each thread is fighting for that one 64 byte block of memory, and is unable to progress until it has it. This makes it so the actual execution is serial, even though multiple cores might be executing code at the same time.

Here are my results for running the programs: (test is your first code sample, test1 is the pthreads sample)

$ time ./test
6000000000

real    0m22.526s
user    0m22.391s
sys     0m0.000s

$ time ./test1
6000000000

real    0m13.094s
user    1m7.797s
sys     0m0.047s

My pthreads test was actually faster. I suspect that it's due to my CPU using hyperthreading, which actually runs two threads on the same core, which shares the same cache line so there is no contention.

I modified the pthreads code to make the globals 64-byte aligned using compiler directives, which forces each one to be in its own cache line.

size_t a __attribute__ ((aligned (64)));
size_t b __attribute__ ((aligned (64)));
size_t c __attribute__ ((aligned (64)));
size_t d __attribute__ ((aligned (64)));
size_t e __attribute__ ((aligned (64)));
size_t f __attribute__ ((aligned (64)));

And here are the results:

$ time ./test2
6000000000

real    0m2.665s
user    0m15.281s
sys     0m0.016s

It's way faster!

Code with threads taking longer than without

Answers (2)

Related Questions