Luke
Luke

Reputation: 301

Why the average speed of n threads is not as fast as one single thread in C?

I wrote a program with 2 threads doing the same thing but I found the throughput of each threads is slower than if I only spawn one thread. Then I write this simple test to see if that's my problem or it's because of the system.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>


/*
 * Function: run_add
 * -----------------------
 * Do addition operation for iteration ^ 3 times
 *
 * returns: void
 */
void *run_add(void *ptr) {
  clock_t t1, t2;
  t1 = clock();

  int sum = 0;
  int i = 0, j = 0, k = 0;
  int iteration = 1000;
  long total = iteration * iteration * iteration;
  for (i = 0; i < iteration; i++) {
    for (j = 0; j < iteration; j++) {
      for (k = 0; k < iteration; k++) {
        sum++;
      }
    }
  }

  t2 = clock();
  float diff = ((float)(t2 - t1) / 1000000.0F );
  printf("thread id = %d\n", (int)(pthread_self()));
  printf("Total addtions: %ld\n", total);
  printf("Total time: %f second\n", diff);
  printf("Addition per second: %f\n", total / diff);
  printf("\n");

  return NULL;
}


void run_test(int num_thread) {
  pthread_t pth_arr[num_thread];
  int i = 0;
  for (i = 0; i < num_thread; i++) {
    pthread_create(&pth_arr[i], NULL, run_add, NULL);
  }

  for (i = 0; i < num_thread; i++) {
    pthread_join(pth_arr[i], NULL);
  }
}

int main() {
  int num_thread = 5;
  int i = 0;
  for (i = 1; i < num_thread; i++) {
    printf("Running SUM with %d threads. \n\n", i);
    run_test(i);
  }
  return 0;
}

The result still shows the average speed of n threads is slower than one single thread. The more threads I have, the slower each one is.

Here's the result:

Running SUM with 1 threads.

thread id = 528384, Total addtions: 1000000000, Total time: 1.441257 second, Addition per second: 693838784.000000

Running SUM with 2 threads.

thread id = 528384, Total addtions: 1000000000, Total time: 2.970870 second, Addition per second: 336601728.000000

thread id = 1064960, Total addtions: 1000000000, Total time: 2.972992 second, Addition per second: 336361504.000000

Running SUM with 3 threads.

thread id = 1064960, Total addtions: 1000000000, Total time: 4.434701 second, Addition per second: 225494352.000000

thread id = 1601536, Total addtions: 1000000000, Total time: 4.449250 second, Addition per second: 224756976.000000

thread id = 528384, Total addtions: 1000000000, Total time: 4.454826 second, Addition per second: 224475664.000000

Running SUM with 4 threads.

thread id = 528384, Total addtions: 1000000000, Total time: 6.261967 second, Addition per second: 159694224.000000

thread id = 1064960, Total addtions: 1000000000, Total time: 6.293107 second, Addition per second: 158904016.000000

thread id = 2138112, Total addtions: 1000000000, Total time: 6.295047 second, Addition per second: 158855056.000000

thread id = 1601536, Total addtions: 1000000000, Total time: 6.306261 second, Addition per second: 158572560.000000

I have a 4-core CPU and my system monitor shows each time I ran n threads, n CPU cores are 100% utilized. Is it true that n threads(<= my CPU cores) are supposed to run n times as fast as one thread? Why it is not the case here?

Upvotes: 0

Views: 117

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84609

While you have your answer regarding why the multi-threaded performance seemed worse than single-thread, there are several things you can do to clean up the logic of your program and make it work like it appears you intended it to.

First, if you were keeping track of the relative wall-time that passed and the time reported by your diff of the clock() times, you would have noticed the time reported was approximately a (n-proccessor core) multiple of the actual wall-time. That was explained in the other answer.

For relative per-core performance timing, the use of clock() is fine. You are getting only an approximation of wall-time, but for looking at a relative additions per-second, that provides a clean per-core look at performance.

While you have correctly used a divisor of 1000000 for diff, time.h provides a convenient define for you. POSIX requires that CLOCKS_PER_SEC equals 1000000 independent of the actual resolution. That constant is provided in time.h.

Next, you should also notice that your output per-core wasn't reported until all threads were joined making reporting totals in run_add somewhat pointless. You can output thread_id, etc. from the individual threads for convenience, but the timing information should be computed back in the calling function after all threads have been joined. That will clean up the logic of your run_add significantly. Further, if you want to be able to vary the number of iterations, you should consider passing that value through ptr. e.g.:

/*
 * Function: run_add
 * -----------------------
 * Do addition operation for iteration ^ 3 times
 *
 * returns: void
 */
void *run_add (void *ptr)
{
    int i = 0, j = 0, k = 0, iteration = *(int *)ptr;
    unsigned long sum = 0;

    for (i = 0; i < iteration; i++)
        for (j = 0; j < iteration; j++)
            for (k = 0; k < iteration; k++)
                sum++;

    printf ("  thread id  = %lu\n", (long unsigned) (pthread_self ()));
    printf ("  iterations = %lu\n\n", sum);

    return NULL;
}

run_test is relatively unchanged, with the bulk of the calculation changes being those moved from run_add to main and being scaled to account for the number of cores utilized. The following is a rewrite of main allowing the user to specify the number of cores to use as the first argument (using all-cores by default) and the base for your cubed number of iterations as the second argument (1000 by default):

int main (int argc, char **argv) {

    int nproc = sysconf (_SC_NPROCESSORS_ONLN), /* number of core available */
        num_thread = argc > 1 ? atoi (argv[1]) : nproc,
        iter = argc > 2 ? atoi (argv[2]) : 1000;
    unsigned long subtotal = iter * iter * iter,
        total = subtotal * num_thread;
    double diff = 0.0, t1 = 0.0, t2 = 0.0;

    if (num_thread > nproc) num_thread = nproc;
    printf ("\nrunning sum with %d threads.\n\n", num_thread);

    t1 = clock ();
    run_test (num_thread, &iter);
    t2 = clock ();
    diff = (double)((t2 - t1) / CLOCKS_PER_SEC / num_thread);

    printf ("----------------\nTotal time: %lf second\n", diff);
    printf ("Total addtions: %lu\n", total);
    printf ("Additions per-second: %lf\n\n", total / diff);

    return 0;
}

Putting all the pieces together, you could write a working example as follows. Make sure you disable optimizations to prevent your compiler from optimizing out your loops for sum, etc...

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <unistd.h>

/*
 * Function: run_add
 * -----------------------
 * Do addition operation for iteration ^ 3 times
 *
 * returns: void
 */
void *run_add (void *ptr)
{
    int i = 0, j = 0, k = 0, iteration = *(int *)ptr;
    unsigned long sum = 0;

    for (i = 0; i < iteration; i++)
        for (j = 0; j < iteration; j++)
            for (k = 0; k < iteration; k++)
                sum++;

    printf ("  thread id  = %lu\n", (long unsigned) (pthread_self ()));
    printf ("  iterations = %lu\n\n", sum);

    return NULL;
}

void run_test (int num_thread, int *it)
{
    pthread_t pth_arr[num_thread];
    int i = 0;

    for (i = 0; i < num_thread; i++)
        pthread_create (&pth_arr[i], NULL, run_add, it);

    for (i = 0; i < num_thread; i++)
        pthread_join (pth_arr[i], NULL);
}

int main (int argc, char **argv) {

    int nproc = sysconf (_SC_NPROCESSORS_ONLN),
        num_thread = argc > 1 ? atoi (argv[1]) : nproc,
        iter = argc > 2 ? atoi (argv[2]) : 1000;
    unsigned long subtotal = iter * iter * iter,
        total = subtotal * num_thread;
    double diff = 0.0, t1 = 0.0, t2 = 0.0;

    if (num_thread > nproc) num_thread = nproc;
    printf ("\nrunning sum with %d threads.\n\n", num_thread);

    t1 = clock ();
    run_test (num_thread, &iter);
    t2 = clock ();
    diff = (double)((t2 - t1) / CLOCKS_PER_SEC / num_thread);

    printf ("----------------\nTotal time: %lf second\n", diff);
    printf ("Total addtions: %lu\n", total);
    printf ("Additions per-second: %lf\n\n", total / diff);

    return 0;
}

Example Use/Output

Now you can measure the relative number of additions per-second performed based on the number of cores utilized -- and have it return a Total time that is roughly what wall-time would be. For example, measuring the additions per-second using a single core results in:

$ ./bin/pthread_one_per_core 1

running sum with 1 threads.

  thread id  = 140380000397056
  iterations = 1000000000

----------------
Total time: 2.149662 second
Total addtions: 1000000000
Additions per-second: 465189411.172547

Approximatey 465M additions per-sec. Using two cores should double that rate:

$ ./bin/pthread_one_per_core 2

running sum with 2 threads.

  thread id  = 140437156796160
  iterations = 1000000000

  thread id  = 140437165188864
  iterations = 1000000000

----------------
Total time: 2.152436 second
Total addtions: 2000000000
Additions per-second: 929179560.000957

Exactly twice the additions per-sec at 929M/s. Using 4-cores:

$ ./bin/pthread_one_per_core 4

running sum with 4 threads.

  thread id  = 139867841853184
  iterations = 1000000000

  thread id  = 139867858638592
  iterations = 1000000000

  thread id  = 139867867031296
  iterations = 1000000000

  thread id  = 139867850245888
  iterations = 1000000000

----------------
Total time: 2.202021 second
Total addtions: 4000000000
Additions per-second: 1816513309.422720

Doubled again to 1.81G/s, and using 8-cores gives the expected results:

$ ./bin/pthread_one_per_core

running sum with 8 threads.

  thread id  = 140617712838400
  iterations = 1000000000

  thread id  = 140617654089472
  iterations = 1000000000

  thread id  = 140617687660288
  iterations = 1000000000

  thread id  = 140617704445696
  iterations = 1000000000

  thread id  = 140617662482176
  iterations = 1000000000

  thread id  = 140617696052992
  iterations = 1000000000

  thread id  = 140617670874880
  iterations = 1000000000

  thread id  = 140617679267584
  iterations = 1000000000

----------------
Total time: 2.250243 second
Total addtions: 8000000000
Additions per-second: 3555171004.558562

3.55G/s. Look over the both answers (currently) and let us know if you have any questions.

note: there are a number of additional clean-ups and validations that could be applied, but for purposes of your example, updating the types to rational unsigned prevents strange results with thread_id and the addition numbers.

Upvotes: 1

Jasen
Jasen

Reputation: 12432

clock() measures CPU time not "Wall" time. it also measures the total time of all threads..

CPU time is time when the processor was executing you code, wall time is real world elapsed time (like a clock on the wall would show)

time your program using /usr/bin/time to see what's really happening. or use a wall-time function like time(), gettimeofday() or clock_gettime()

clock_gettime() can measure CPU time for this thread, for this process, or wall time. - it's probably the best way to do this type of experiment.

Upvotes: 6

Related Questions