Reputation: 518
I was trying to see the effect of time slicing. And how it can consume significant amount of time. Actually, I was trying to divide a certain work into number of threads and see the effect.
I have a two core processor. So two threads can run in parallel. I was trying to see if I have a work w that is done by 2 threads, and if I have the same work done by t threads with each thread doing w/t of the work. How much does time slicing play a role in it
As time slicing is time consuming process, I was expecting that when I do the same work using a two thread process or by a t thread process, the amount of time taken by the t thread process will be more
However, I found that it was not the case. I tried with t=10. And still it is faster than the 2 thread process. For eg. if I have to do 10,000,000 iterations, with the two thread process I will have the 2 threads do iterations for 5,000,000 so that we have a total of 10,000,000 iterations. If I have to do with the 10 thread process, I will let each thread do iterations for 1,000,000 so that we have a total of 10,000,000 as well.
I was expecting the 10 thread process to consume more time. But it's not the case. Is there any bug in the code? It looks fine to me
Any suggestions?
Upvotes: 5
Views: 671
Reputation: 62439
You are doing 10000000 (10 million) x 1000 iterations sequentially and 5000000 (5 million) x 1000 iterations for each thread in the parallel version. In my experience that's more than enough work to make the startup overhead insignificant. The results seem correct to me.
With 2 cores and 2 threads there is no timeslicing involved (at least among the 2 worker threads), as the scheduler is smart enough to put the threads on separate cores and keep them there.
In order to see some degradation you need to move some memory around through the caches, such that each context switch actually penalizes the performance by causing some data to be evicted from the caches. Here's the running times I'm getting:
./a.out 2 500000000
Number of threads = 2
Number of iterations in each thread = 250000000
Total time taken = 5.931148
./a.out 1000 500000000
Number of threads = 1000
Number of iterations in each thread = 500000
Total time taken = 6.563666
./a.out 2000 500000000
Number of threads = 2000
Number of iterations in each thread = 250000
Total time taken = 7.087449
And here's the code. I'm basically partitioning a large array among the given threads and squaring every item in the array:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
long* array;
int length;
int threads;
void *tfunc(void *arg) {
int n = (int)arg;
int i;
int j;
int x;
long sum = 0;
//printf("%d\n",*n);
int start = n * (length / threads);
int end = (n + 1) * (length / threads);
for (i=start; i<end; i++) {
array[i] = array[i] * array[i];
//printf("%d\n",i);
}
return(0);
}
double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return (double)tp.tv_sec + tp.tv_usec / 1000000.;
}
int main(int argc, char *argv[]) {
int numberOfIterations = atoi(argv[2]);
int numberOfThreads = atoi(argv[1]);
int i;
printf("Number of threads = %d\n",numberOfThreads);
printf("Number of iterations in each thread = %d \n", numberOfIterations / numberOfThreads);
pthread_t workerThreads[numberOfThreads];
int *arg = &numberOfIterations;
array = (long*)malloc(numberOfIterations * sizeof(long));
length = numberOfIterations;
threads = numberOfThreads;
int result[numberOfThreads];
double timeTaken;
timeTaken = timestamp();
for(i=0; i<numberOfThreads; i++) {
result[i] = pthread_create(workerThreads+i, NULL, tfunc, (void*)i);
}
for(i=0; i<numberOfThreads; i++) {
pthread_join(workerThreads[i], NULL);
}
timeTaken = timestamp() - timeTaken;
printf("Total time taken = %f\n", timeTaken);
/*printf("The results are\n");
for(i=0; i<numberOfThreads; i++) {
printf("%d\n",result[i]);
}*/
free(array);
exit(0);
}
Upvotes: 1
Reputation: 24847
For an app to demonstrate a significant, easily-measurable slowdown with many more threads than processors, you have to work at it:
1) The threads must be CPU-intensive, ie. not block on I/O or each other. If you are using a simple counting loop, (as it sounds like you are), then yes, that's done.
2) You have to arrange each thread to work on data that is large enough so that the L1 cache requires significant flushing upon a context-swap. If you just increment one integer, this flushing will not happen and the context-switch overhead will be too small, (compared with the interval between timer-driven scheduling runs), to easily demonstrate.
Windows example data, minimal cache-flushing, i7, 4/8 cores:
8 tests,
400 tasks,
counting to 10000000,
using 8 threads:
Ticks: 2199
Ticks: 2184
Ticks: 2215
Ticks: 2153
Ticks: 2200
Ticks: 2215
Ticks: 2200
Ticks: 2230
Average: 2199 ms
8 tests,
400 tasks,
counting to 10000000,
using 32 threads:
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2138
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2137
Average: 2137 ms
8 tests,
400 tasks,
counting to 10000000,
using 128 threads:
Ticks: 2168
Ticks: 2106
Ticks: 2184
Ticks: 2106
Ticks: 2137
Ticks: 2122
Ticks: 2106
Ticks: 2137
Average: 2133 ms
8 tests,
400 tasks,
counting to 10000000,
using 400 threads:
Ticks: 2137
Ticks: 2153
Ticks: 2059
Ticks: 2153
Ticks: 2168
Ticks: 2122
Ticks: 2168
Ticks: 2138
Average: 2137 ms
Upvotes: 1
Reputation:
If you have multiple logical cores, your threads will execute in parallel.
To test your hypothesis, you need to pin them to a single logical core.
Upvotes: 0
Reputation: 340188
How many CPU cores do you have on your machine? The thing about CPU bound threads is that even though there may be overhead for setting up and scheduling the threads that doesn't exist when you have only a single thread, if those threads can actually execute at the same time (instead of just appearing to execute at the the same time) then the threads can still produce a gain that's greater than the overhead costs.
Upvotes: 0