Parallelization of Quicksort in C

Question

For an assignment I got a finished functional sequential version of a quicksort program in C, but now I need to speed it up with parallelization with the use of threads. The modified program below is sorting and are using the threads (using >100% CPU), but it takes longer now than before when executing in sequential.

The way this program is suppose to work, is that when the 8th thread has been terminated, the program should now instead continue to execute the rest of the program sequential.

I know maybe this is not the optional way when optimizing with threads, but that's how this program should work. I guess there has something to do with the pthread_join, but I tried to move it, but that makes no difference, or even worse sometimes.

*Note, I did include the initializing of the array here, because it's not relevant.

#include 
#include 
#include 

#define THREADS 8
#define KILO (1024)
#define MEGA (1024*1024) //1 048 576
#define MAX_ITEMS (64 *MEGA) //67 108 864
#define swap(v, a, b) {unsigned tmp; tmp=v[a]; v[a]=v[b]; v[b]=tmp;}

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static int *v; 
int global_thread = 1; //Starts at 1, because the thread from main is the first

struct threadArgs 
{
    int low;
    int high;
};

static int partition(int *v, int low, int high, int pivot_index)
{
    if (pivot_index != low)
        swap(v, low, pivot_index);

    pivot_index = low;
    low++;

    while (low <= high) 
    {
        if (v[low] <= v[pivot_index]) 
            low++;
        else if (v[high] > v[pivot_index])
            high--;
        else
            swap(v, low, high);
    }
    
    if (high != pivot_index)
        swap(v, pivot_index, high);

    return high;
}

void* quick_sort(void* arguments)
{
    struct threadArgs* parent = (struct threadArgs*) arguments;
    pthread_t child;
    
    struct threadArgs* child_thread;
    child_thread =  malloc(sizeof(struct threadArgs));    
    
    int pivot_index;

    int low = parent->low;
    int high = parent->high;

    if (low >= high)
        return NULL;

    pivot_index = (low + high) / 2;
    pivot_index = partition(v, low, high, pivot_index);
    
    if(global_thread < THREADS) //should only create 7 new threads, 1 is already from main
    {
        if(pivot_index < high) //child: right side
        {
            child_thread->low = pivot_index +1;
            child_thread->high = high;
            pthread_mutex_lock(&lock);
            global_thread++;
            pthread_mutex_unlock(&lock);
            
            pthread_create(&child, NULL, quick_sort, (void*) child_thread);
        }
        if(low < pivot_index) //parent: left side,
        {
            parent->low = low;
            parent->high = pivot_index - 1;
            quick_sort((void*) parent);
        }
        pthread_join(child, NULL);
    }
    else
    {
        if (low < pivot_index) //left side sequential
        {
            parent->low = low;
            parent->high = pivot_index - 1;
            quick_sort((void*) parent);
        }
        if (pivot_index < high) //right side sequential
        {
            parent->low = pivot_index +1;
            parent->high = high;
            quick_sort((void*) parent);
        }
    }
    return NULL;
}

int main(int argc, char **argv)
{
    init_array();

    struct threadArgs* args; // main thread
    args =  malloc(sizeof(struct threadArgs));
    args->low = 0;
    args->high = MAX_ITEMS - 1;

    quick_sort((void *)args);
    pthread_mutex_destroy(&lock);
    free(args);
    return 0;
}

John Bollinger · Accepted Answer

Your code has a serious memory leak: every call to quick_sort() allocates memory for a child thread's arguments, even if it does not ultimately start a new child thread, and none of that memory is ever freed. malloc() is relatively expensive, and depending on your implementation, it may acquire a lock, thus creating a bottleneck.

Your code has a data race: quick_sort() reads shared variable global_threads outside the protection of a mutex or other synchronization object, and that variable is also modified by some of your threads. The resulting behavior is therefore undefined.

Your code has a flaw in that the global_thread < THREADS branch of your quick_sort() function only conditionally starts a new thread, but unconditionally tries to join one. If no child thread was started then the join attempt will produce undefined behavior on account of reading the value of variable child while it is indeterminate. That would be unlikely to happen in practice with random data, however.

Suggestions:

move the malloc() call so that you allocate only in the event that you are actually going to create a new thread. Since the amount per allocation is small and the number of allocations will then be small, you can probably get away with continuing to allow that memory to leak, but it would be better to free it after joining the child.
Change strategy for how you decide whether to start a child thread. For example, track the recursion depth, and launch a new child only a shallow depth (two or three levels). You then do not need a shared variable or mutex, just an additional member in struct threadArgs.
Make sure not to try to join a child thread when you have not launched one.

Implementation is left as an exercise, but my own implementation of those fixes yielded significant single-thread performance improvement and exhibited reasonable (but not great) speedup from adding more threads: about 30% speedup with two threads, and a bit more than 50% with four.

Parallelization of Quicksort in C

Answers (1)

Related Questions