Reputation: 391
I was trying to run a function on multiple pthreads in order to increase efficiency and runtime. This function performs a lot of matrix calculations and print statements. However, when I ran tests in order see the performance improvement, the single threaded code ran faster.
My tests went as follows:
-For the single-threaded: Run a for loop 1:1000 that called the function.
-For the multi-pthreaded: Spawn 100 pthreads, have a queue of 1000 items and a pthread_cond_wait and have the threads run the function until the queue is empty.
Here is my code for the pthreads (single-threaded is just a for loop instead):
# include <iostream>
# include <string>
# include <pthread.h>
# include <queue>
using namespace std;
# define NUM_THREADS 100
int main ( );
queue<int> testQueue;
void *playQueue(void* arg);
void matrix_exponential_test01 ( );
void matrix_exponential_test02 ( );
pthread_mutex_t queueLock;
pthread_cond_t queue_cv;
int main()
{
pthread_t threads[NUM_THREADS];
pthread_mutex_init(&queueLock, NULL);
pthread_cond_init (&queue_cv, NULL);
for( int i=0; i < NUM_THREADS; i++ )
{
pthread_create(&threads[i], NULL, playQueue, (void*)NULL);
}
pthread_mutex_lock (&queueLock);
for(int z=0; z<1000; z++)
{
testQueue.push(1);
pthread_cond_signal(&queue_cv);
}
pthread_mutex_unlock (&queueLock);
pthread_mutex_destroy(&queueLock);
pthread_cond_destroy(&queue_cv);
pthread_cancel(NULL);*/
return 0;
}
void* playQueue(void* arg)
{
bool accept;
while(true)
{
pthread_cond_wait(&queue_cv, &queueLock);
accept = false;
if(!testQueue.empty())
{
testQueue.pop();
accept = true;
}
pthread_mutex_unlock (&queueLock);
if(accept)
{
runtest();
}
}
pthread_exit(NULL);
}
My intuition tells me that the multi-threaded version should run faster, but it doesnt. Is there a reason, or is my code faulty? I am using C++ on Windows, and had to download a library to use pthreads.
Upvotes: 1
Views: 1241
Reputation:
If runtest()
is CPU-bound -- that is, doesn't do anything which might block on i/o or the like -- then there's not much point starting 100 threads, unless you have 100 cpus/cores ! [Edit: I now notice that runtest()
does some print statements... file i/o probably won't block... so won't release the CPU.]
The code as currently shown holds the mutex while filling the queue, so nothing will start until the queue is full. By the time filling of the queue has finished signalling 1000 times, if any have reached the pthread_cond_wait()
, then hopefully they will all have been started -- so all 100 will be waiting on the mutex.
As currently shown the waiting in playQueue()
is broken. It should be something along the lines of:
pthread_mutex_wait(&queueLock) ;
while (testQueue.empty)
pthread_cond_wait(&queue_cv, &queueLock) ;
if (testQueue.eof)
val = NULL ;
else
val = testQueue.pop ;
pthread_mutex_unlock(&queueLock) ;
But, even when this is all sorted out, there is no guarantee you will see an improvement in performance, unless runtest()
does a serious amount of work. [Edit: I now notice that it does "a lot of matrix calculations", which sounds like it could be plenty of work.]
One small suggestion, starting the worker threads and filling the queue could be overlapped, by (for instance) starting one worker with the job of starting all the others, or starting a thread to fill the queue.
Without knowing more about the problem, if the work can be statically divided across the worker threads -- for instance, give the first thread items 0..9, the second thread items 10..19, and so on -- so each worker can ignore all the others, reducing the amount of synchronisation operations.
Upvotes: 4
Reputation: 40659
In addition to the other good answers, you said your runtest()
function does I/O.
So you could well be I/O bound, in which case all your threads have to wait in line like everybody else to empty out their buffers.
Upvotes: 1
Reputation: 429
First, your code is written in a way that only one thread will run at any time (your mutex is locked the whole time the thread is doing work). So at best you can expect your Code to be as fast as the single threaded version.
Also, all threads reading and writing the same memory each time. This way you force your CPU cores to synchronize their caches, meaning actually more load on the bus than would be caused by a single thread. Since you are not doing any computationally expensive stuff, it is likely that memory bandwidth is your actual bottleneck and thus the bus load added by cache synchronization slows down your program. Take a look at http://en.wikipedia.org/wiki/False_sharing for more information.
Upvotes: 5