Murat Ayfer
Murat Ayfer

Reputation: 3914

Can't get any speedup from parallelizing Quicksort using Pthreads

I'm using Pthreads to create a new tread for each partition after the list is split into the right and left halves (less than and greater than the pivot). I do this recursively until I reach the maximum number of allowed threads.

When I use printfs to follow what goes on in the program, I clearly see that each thread is doing its delegated work in parallel. However using a single process is always the fastest. As soon as I try to use more threads, the time it takes to finish almost doubles, and keeps increasing with number of threads.

I am allowed to use up to 16 processors on the server I am running it on.

The algorithm goes like this: Split array into right and left by comparing the elements to the pivot. Start a new thread for the right and left, and wait until the threads join back. If there are more available threads, they can create more recursively. Each thread waits for its children to join.

Everything makes sense to me, and sorting works perfectly well, but more threads makes it slow down immensely.

I tried setting a minimum number of elements per partition for a thread to be started (e.g. 50000).

I tried an approach where when a thread is done, it allows another thread to be started, which leads to hundreds of threads starting and finishing throughout. I think the overhead was way too much. So I got rid of that, and if a thread was done executing, no new thread was created. I got a little more speedup but still a lot slower than a single process.

The code I used is below.

http://pastebin.com/UaGsjcq2

Does anybody have any clue as to what I could be doing wrong?

Upvotes: 2

Views: 2316

Answers (4)

Clifford
Clifford

Reputation: 93476

Unless each thread is running on a separate processor or core they will not truly run concurrently and the context switch time will be significant. The number of threads should be restricted to the number of available execution units, and even then you have to trust the OS will distribute them to separate processors/cores, which it may not do if they are also being used for other processes.

Also you should use a static thread pool rather than creating and destroying threads dynamically. Creating/destroying a thread includes allocating/releasing a stack from the heap, which is non-deterministic and potentially time-consuming.

Finally are the 16 processors on the server real or VMs? And are they exclusively allocated to your process?

Upvotes: 0

mathk
mathk

Reputation: 8143

I just have a quick look at your code. And i got a remark. Why are you using lock. If I understand what you are doing is something like:

quickSort(array)
{
    left, right = partition(array);
    newThread(quickSort(left));
    newThread(quickSort(right));
}

You shouldn't need lock. Normally each call to quick sort do not access the other part of the array. So no sharing is involve.

Upvotes: 1

Michael Dorgan
Michael Dorgan

Reputation: 12515

First guess would be that creating, destroying, and especially the syncing your threads is going to eat up and possible gain you might receive depending on just how many elements you are sorting. I'd actually guess that it would take quite a long while to make up the overhead and that it probably won't ever be made up.

Because of the way you have your sort, you have 1 thread waiting for another waiting for another... you aren't really getting all that much parallelism to begin with. You'd be better off using a more linear sort, perhaps something like a Radix, that splits the threads up with more further data. That's still having one thread wait for others a lot, but at least the threads get to do more work in the mean time. But still, I don't think threads are going to help too much even with this.

Upvotes: 1

Jerry Coffin
Jerry Coffin

Reputation: 490148

Starting a thread has a fair amount of overhead. You'd probably be better off creating a threadpool with some fixed number of threads, along with a thread-safe queue to queue up jobs for the threads to do. The threads wait for an item in the queue, process that item, then wait for another item. If you want to do things really correctly, this should be a priority queue, with the ordering based on the size of the partition (so you always sort the smallest partitions first, to help keep the queue size from getting excessive).

This at least reduces the overhead of starting the threads quite a bit -- but that still doesn't guarantee you'll get better performance than a single-threaded version. In particular, a quick-sort involves little enough work on the CPU itself that it's probably almost completely bound by the bandwidth to memory. Processing more than one partition at a time may hurt cache locality to the point that you lose speed in any case.

Upvotes: 6

Related Questions