Reputation:
I'm experimenting with threads in C, and I'm confused upon some results.
I have the following loop:
for(size_t i=0;i<1000000000;i++){
a++;
}
It increments a global variable a
. I've done this with 6 variables, a
to e
.
First, I incremented the variables consecutively in main
:
#include<stdio.h>
#include<pthread.h>
size_t a,b,c,d,e,f;
int main(void){
for(size_t i=0;i<1000000000;i++){
a++;
}
for(size_t i=0;i<1000000000;i++){
b++;
}
for(size_t i=0;i<1000000000;i++){
c++;
}
for(size_t i=0;i<1000000000;i++){
d++;
}
for(size_t i=0;i<1000000000;i++){
e++;
}
for(size_t i=0;i<1000000000;i++){
f++;
}
size_t abcdef=a+b+c+d+e+f;
printf("%zu\n",abcdef);
return 0;
}
Then, when testing the program with time
, the following results were given:
6000000000
real 0m11.450s
user 0m11.446s
sys 0m0.000s
I'd expect the results using pthreads to be quite a bit faster:
#include<stdio.h>
#include<pthread.h>
size_t a,b,c,d,e,f;
void *t1(void *args){
for(size_t i=0;i<1000000000;i++){
a++;
}
return NULL;
}
void *t2(void *args){
for(size_t i=0;i<1000000000;i++){
b++;
}
return NULL;
}
void *t3(void *args){
for(size_t i=0;i<1000000000;i++){
c++;
}
return NULL;
}
void *t4(void *args){
for(size_t i=0;i<1000000000;i++){
d++;
}
return NULL;
}
void *t5(void *args){
for(size_t i=0;i<1000000000;i++){
e++;
}
return NULL;
}
void *t6(void *args){
for(size_t i=0;i<1000000000;i++){
f++;
}
return NULL;
}
int main(void){
pthread_t p1,p2,p3,p4,p5,p6;
pthread_create(&p1,NULL,t1,NULL);
pthread_create(&p2,NULL,t2,NULL);
pthread_create(&p3,NULL,t3,NULL);
pthread_create(&p4,NULL,t4,NULL);
pthread_create(&p5,NULL,t5,NULL);
pthread_create(&p6,NULL,t6,NULL);
pthread_join(p1,NULL);
pthread_join(p2,NULL);
pthread_join(p3,NULL);
pthread_join(p4,NULL);
pthread_join(p5,NULL);
pthread_join(p6,NULL);
size_t abcdef=a+b+c+d+e+f;
printf("%zu\n",abcdef);
return 0;
}
However, the results were quite unexpecting:
6000000000
real 0m14.521s
user 1m26.048s
sys 0m0.014s
Not only is the real time larger, which I'd expect to be lower, but the user time is over 1 minute, which, I did not wait a minute.
What is happening here? How can I solve it?
Upvotes: 2
Views: 69
Reputation: 106
The issue that you're running into here is due to cache coherency.
In modern processors, the actual minimum amount of memory that a single core can access at a time is a full cache line, which on many modern processors is 64 bytes. That means with every increment of each variable, 64 bytes are read, and 8 bytes of that are modified for the increment. The other 56 bytes are just along for the ride.
However, if any of those other bytes need to be modified by another core, they have to use a cache coherence protocol to ensure that they don't corrupt each other's memory. When a cache line is written to, it will be marked as modified, and each other cache will have to mark it as invalid and reload it to use it again.
When you define your variables in your code as:
size_t a,b,c,d,e,f;
they are all lined up in memory as one contiguous block, which will end up being less than a full cache line. That means that each thread is fighting for that one 64 byte block of memory, and is unable to progress until it has it. This makes it so the actual execution is serial, even though multiple cores might be executing code at the same time.
Here are my results for running the programs: (test is your first code sample, test1 is the pthreads sample)
$ time ./test
6000000000
real 0m22.526s
user 0m22.391s
sys 0m0.000s
$ time ./test1
6000000000
real 0m13.094s
user 1m7.797s
sys 0m0.047s
My pthreads test was actually faster. I suspect that it's due to my CPU using hyperthreading, which actually runs two threads on the same core, which shares the same cache line so there is no contention.
I modified the pthreads code to make the globals 64-byte aligned using compiler directives, which forces each one to be in its own cache line.
size_t a __attribute__ ((aligned (64)));
size_t b __attribute__ ((aligned (64)));
size_t c __attribute__ ((aligned (64)));
size_t d __attribute__ ((aligned (64)));
size_t e __attribute__ ((aligned (64)));
size_t f __attribute__ ((aligned (64)));
And here are the results:
$ time ./test2
6000000000
real 0m2.665s
user 0m15.281s
sys 0m0.016s
It's way faster!
Upvotes: 1
Reputation: 1
Not sure.. but thread creation , thread termination , library initialization will be taking some time in case of threading. So that can be the reason. You can try print time before and after the loop in each thread function and before and after all the loops in main thread.
Upvotes: 0