How can multithreading give a factor of speeedup that is larger than the number of cores?

Question

I am using pthreads with gcc. The simple code example takes the number of threads "N" as a user-supplied input. It splits up a long array into N roughly equally sized subblocks. Each subblock is written into by individual threads.

The dummy processing for this example really involves sleeping for a fixed amount of time for each array index and then writing a number into that array location.

Here's the code:

/******************************************************************************
* FILE: threaded_subblocks_processing
* DESCRIPTION:
* We have a bunch of parallel processing to do and store the results in a
* large array. Let's try to use threads to speed it up.
******************************************************************************/
#include 
#include 
#include 
#include 
#include 

#define BIG_ARR_LEN 10000

typedef struct thread_data{
  int start_idx;
  int end_idx;
  int id;
} thread_data_t;

int big_result_array[BIG_ARR_LEN] = {0};

void* process_sub_block(void *td)
{
   struct thread_data *current_thread_data = (struct thread_data*)td;
   printf("[%d] Hello World! It's me, thread #%d!
", current_thread_data->id, current_thread_data->id);
   printf("[%d] I'm supposed to work on indexes %d through %d.
", current_thread_data->id, 
       current_thread_data->start_idx, 
       current_thread_data->end_idx-1);

   for(int i=current_thread_data->start_idx; iend_idx; i++)
   {
       int retval = usleep(1000.0*1000.0*10.0/BIG_ARR_LEN);
       if(retval)
       {
         printf("sleep failed");
       }

       big_result_array[i] = i;
   }

   printf("[%d] Thread #%d done, over and out!
", current_thread_data->id, current_thread_data->id);
   pthread_exit(NULL);
}

int main(int argc, char *argv[])
{
   if (argc!=2)
   {
     printf("usage: ./a.out number_of_threads
");
     return(1);
   }

   int NUM_THREADS = atoi(argv[1]);

   if (NUM_THREADS<1)
   {
     printf("usage: ./a.out number_of_threads (where number_of_threads is at least 1)
");
     return(1);
   }

   pthread_t *threads = malloc(sizeof(pthread_t)*NUM_THREADS);
   thread_data_t *thread_data_array = malloc(sizeof(thread_data_t)*NUM_THREADS);

   int block_size = BIG_ARR_LEN/NUM_THREADS;
   for(int i=0; i



and I compile it with

gcc -pthread threaded_subblock_processing.c


and then I call it from command line like so:

$ time ./a.out 4


I see a speed up when I increase the number of threads. With 1 thread the process takes just a little over 10 seconds. This makes sense because I sleep for 1000 usec per array element, and there are 10,000 array elements. Next when I go to 2 threads, it goes down to a little over 5 seconds, and so on.

What I don't understand is that I get a speed-up even after my number of threads exceeds the number of cores on my computer! I have 4 cores, so I was expecting no speed-up for >4 threads. But, surprisingly, when I run

$ time ./a.out 100


I get a 100x speedup and the processing completes in ~0.1 seconds! How is this possible?

Richard · Accepted Answer

Some general background

A program's progress can be slowed by many things, but, in general, you can slow spots (otherwise known as hot spots) into two categories:

CPU Bound: In this case, the processor is doing some heavy number crunching (like trigonometric functions). If all the CPU's cores are engaged in such tasks, other processes must wait.
Memory bound: In this case, the processor is waiting for information to be retrieved from the hard disk or RAM. Since these are typically orders of magnitude slower than the processor, from the CPU's perspective this takes forever.

But you can also imagine other situations in which a process must wait, such as for a network response.

In many of these memory-/network-bound situations, it is possible to put a thread "on hold" while the memory crawls towards the CPU and do other useful work in the meantime. If this is done well then a multi-threaded program can well out-perform its single-threaded equivalent. Node.js makes use of such asynchronous programming techniques to achieve good performance.

Here's a handy depiction of various latencies:

Your question

Now, getting back to your question: you have multiple threads going, but they are performing neither CPU-intensive nor memory-intensive work: there's not much there to take up time. In fact, the sleep function is essentially telling the operating system that no work is being done. In this case, the OS can do work in other threads while your threads sleep. So, naturally, the apparent performance increases significantly.

Note that for low-latency applications, such as MPI, busy waiting is sometimes used instead of a sleep function. In this case, the program goes into a tight loop and repeatedly checks a condition. Externally, the effect looks similar, but sleep uses no CPU while the busy wait uses ~100% of the CPU.

How can multithreading give a factor of speeedup that is larger than the number of cores?

Answers (1)

Related Questions