user3692545
user3692545

Reputation: 41

Pthread program runs slower as thread increases

I'm a beginner in parallel programming and I tried to write a parallel program with pthread library. I ran the program on a 8 processor computer. The problem is that when I increase NumProcs, each thread slows down though their tasks are always the same. Can someone help me to figure out what is happening? `

#define MAX_NUMP 16
using namespace std;
int NumProcs;

pthread_mutex_t   SyncLock; /* mutex */
pthread_cond_t    SyncCV; /* condition variable */
int               SyncCount; /* number of processors at the barrier so far */

pthread_mutex_t   ThreadLock; /* mutex */

// used only in solaris. use clock_gettime in linux
//hrtime_t          StartTime;
//hrtime_t          EndTime;  

struct timespec StartTime;
struct timespec EndTime;

void Barrier()
{
  int ret;

  pthread_mutex_lock(&SyncLock); /* Get the thread lock */
  SyncCount++;
  if(SyncCount == NumProcs) {
    ret = pthread_cond_broadcast(&SyncCV);
    assert(ret == 0);
  } else {
    ret = pthread_cond_wait(&SyncCV, &SyncLock); 
    assert(ret == 0);
  }
  pthread_mutex_unlock(&SyncLock);
}


/* The function which is called once the thread is allocated */
void* ThreadLoop(void* tmp)
{
  /* each thread has a private version of local variables */
  long threadId = (long) tmp; 
  int ret;
  int startTime, endTime;
  int count=0;
  /* ********************** Thread Synchronization*********************** */
  Barrier();

  /* ********************** Execute Job ********************************* */
  startTime = clock();
  for(int i=0;i<65536;i++)
    for(int j=0;j<1024;j++)
        count++;
  endTime = clock();
  printf("threadid:%ld, time:%d\n",threadId,endTime-startTime);
}


int main(int argc, char** argv)
{
  pthread_t*     threads;
  pthread_attr_t attr;
  int            ret;
  int            dx;

  if(argc != 2) {
    fprintf(stderr, "USAGE: %s <numProcesors>\n", argv[0]);
    exit(-1);
  }
  assert(argc == 2);
  NumProcs = atoi(argv[1]);
  assert(NumProcs > 0 && NumProcs <= MAX_NUMP);

  /* Initialize array of thread structures */
  threads = (pthread_t *) malloc(sizeof(pthread_t) * NumProcs);
  assert(threads != NULL);

  /* Initialize thread attribute */
  pthread_attr_init(&attr);
  pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); // sys manages contention

  /* Initialize mutexs */
  ret = pthread_mutex_init(&SyncLock, NULL);
  assert(ret == 0);
  ret = pthread_mutex_init(&ThreadLock, NULL);
  assert(ret == 0);

  /* Init condition variable */
  ret = pthread_cond_init(&SyncCV, NULL);
  assert(ret == 0);
  SyncCount = 0;

  Count = 0;

  /* get high resolution timer, timer is expressed in nanoseconds, relative
   * to some arbitrary time.. so to get delta time must call gethrtime at
   * the end of operation and subtract the two times.
   */
  //StartTime = gethrtime();
  ret = clock_gettime(CLOCK_MONOTONIC, &StartTime);

  for(dx=0; dx < NumProcs; dx++) {
    /* ************************************************************
     * pthread_create takes 4 parameters
     *  p1: threads(output)
     *  p2: thread attribute
     *  p3: start routine, where new thread begins
     *  p4: arguments to the thread
     * ************************************************************ */
    ret = pthread_create(&threads[dx], &attr, ThreadLoop, (void*) dx);
    assert(ret == 0);

  }

  /* Wait for each of the threads to terminate */
  for(dx=0; dx < NumProcs; dx++) {
    ret = pthread_join(threads[dx], NULL);
    assert(ret == 0);
  }

  //EndTime = gethrtime();
  ret = clock_gettime(CLOCK_MONOTONIC, &EndTime);

  printf("Time = %ld nanoseconds\n", EndTime.tv_nsec - StartTime.tv_nsec);

  pthread_mutex_destroy(&ThreadLock);

  pthread_mutex_destroy(&SyncLock);
  pthread_cond_destroy(&SyncCV);
  pthread_attr_destroy(&attr);

  return 0;
}

Upvotes: 4

Views: 943

Answers (2)

quantdev
quantdev

Reputation: 23793

Your observation is expected.

The main factors that usually impact this situation (worker spinning on local computation) are:

  • The ratio nb_threads / nb_available_machine_cores
  • The affinity of each thread

The optimal scenario here is when you have a ratio of 1, and each thread has a unique affinity with one of the core.

The idea is to maximize each core throughput. You can do that by having one and only one thread running on each core. If you increase the number of threads (ratio > 1), several threads will share the same core, forcing the kernel (through the task scheduler) to switch between the execution of each of them. This is what you were observing.

Each time the kernel has to operate such a switch, you pay for a context switch. It may become a noticeable overhead.

Note:

You can use pthread_setaffinity to set the affinity of your threads.

Upvotes: 1

Lasse Reinhold
Lasse Reinhold

Reputation: 166

If you are running this in release mode (O3 compiler flag) then there are two things wrong with ThreadLoop():

1) There is never any external usage of the 'count' result, so the compiler will omit computing it because it has no visible effect.

2) Even if there had been external usage of 'count' then the compiler will compute the result at compile time and simply emit the value directly.

You can see all this if you disassemble the binary.

You can declare 'volatile int count' to bypass both problems or you can compile with O1 compiler flag or do both.

The loop should scale pretty linearly with number of threads because there is no memory contention. By the way, you should increase the loop iterations because I think the duration could be close to the noise ratio...

Upvotes: 0

Related Questions