Reputation: 118
I am currently working on a recursively parallel Quicksort extension function for the List class. The code below represents the most basic thread distribution criteria I've considered because it should be the simplest to conceptually explain. It branches to the depth of the base-2 logarithm of the number of detected processors, and proceeds sequentially from there. Thus, each CPU should get one thread with a (roughly) equal, large share of data to process, avoiding excessive overhead time. The basic sequential algorithm is provided for comparison.
public static class Quicksort
{
/// <summary>
/// Helper class to hold information about when to parallelize
/// </summary>
/// <attribute name="maxThreads">Maximum number of supported threads</attribute>
/// <attribute name="threadDepth">The depth to which new threads should
/// automatically be made</attribute>
private class ThreadInfo
{
internal int maxThreads;
internal int threadDepth;
public ThreadInfo(int length)
{
maxThreads = Environment.ProcessorCount;
threadDepth = (int)Math.Log(maxThreads, 2);
}
}
/// <summary>
/// Helper function to perform the partitioning step of quicksort
/// </summary>
/// <param name="list">The list to partition</param>
/// <param name="start">The starting index</param>
/// <param name="end">The ending index/param>
/// <returns>The final index of the pivot</returns>
public static int Partition<T>(this List<T> list, int start, int end) where T: IComparable
{
int middle = (int)(start + end) / 2;
// Swap pivot and first item.
var temp = list[start];
list[start] = list[middle];
list[middle] = temp;
var pivot = list[start];
var swapPtr = start;
for (var cursor = start + 1; cursor <= end; cursor++)
{
if (list[cursor].CompareTo(pivot) < 0)
{
// Swap cursor element and designated swap element
temp = list[cursor];
list[cursor] = list[++swapPtr];
list[swapPtr] = temp;
}
}
// Swap pivot with final lower item
temp = list[start];
list[start] = list[swapPtr];
list[swapPtr] = temp;
return swapPtr;
}
/// <summary>
/// Method to begin parallel quicksort algorithm on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
public static void QuicksortParallel<T>(this List<T> list) where T : IComparable
{
if (list.Count < 2048)
list.QuicksortSequential();
else
{
var info = new ThreadInfo(list.Count);
list.QuicksortRecurseP(0, list.Count - 1, 0, info);
}
}
/// <summary>
/// Method to implement parallel quicksort recursion on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
/// <param name="start">The starting index of the partition</param>
/// <param name="end">The ending index of the partition (inclusive)</param>
/// <param name="depth">The current recursive depth</param>
/// <param name="info">Structure holding decision-making info for threads</param>
private static void QuicksortRecurseP<T>(this List<T> list, int start, int end, int depth,
ThreadInfo info)
where T : IComparable
{
if (start >= end)
return;
int middle = list.Partition(start, end);
if (depth < info.threadDepth)
{
var t = Task.Run(() =>
{
list.QuicksortRecurseP(start, middle - 1, depth + 1, info);
});
list.QuicksortRecurseP(middle + 1, end, depth + 1, info);
t.Wait();
}
else
{
list.QuicksortRecurseS(start, middle - 1);
list.QuicksortRecurseS(middle + 1, end);
}
}
/// <summary>
/// Method to begin sequential quicksort algorithm on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
public static void QuicksortSequential<T>(this List<T> list) where T : IComparable
{
list.QuicksortRecurseS(0, list.Count - 1);
}
/// <summary>
/// Method to implement sequential quicksort recursion on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
/// <param name="start">The starting index of the partition</param>
/// <param name="end">The ending index of the partition (inclusive)</param>
private static void QuicksortRecurseS<T>(this List<T> list, int start, int end) where T : IComparable
{
if (start >= end)
return;
int middle = list.Partition(start, end);
// Now recursively sort the (approximate) halves.
list.QuicksortRecurseS(start, middle - 1);
list.QuicksortRecurseS(middle + 1, end);
}
}
As far as I understand, this methodology should produce a one-time startup cost, then proceed in sorting the rest of the data significantly faster than sequential method. However, the parallel method takes significantly more time than the sequential method, which actually increases as the load increases. Benchmarked at a list of ten million items on a 4-core CPU, the sequential method averages around 18 seconds to run to completion, while the parallel method takes upwards of 26 seconds. Increasing the allowed thread depth rapidly exacerbates the problem.
Any help in finding the performance hog is appreciated. Thanks!
Upvotes: 3
Views: 220
Reputation: 52240
The problem is CPU cache conflict, also known as "false sharing."
Unless the memory address of the pivot point happens to fall on a cache line, one of the threads will get a lock on the L1 or L2 cache and the other one will have to wait. This can make performance even worse than a serial solution. The problem is described well in this article:
...where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. [1,2] This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
(snip)
In most cases, the parallel code ran actually ran slower than the sequential code, and in no case did we get any better than a 42% speedup no matter how many cores we threw at the problem.
Grasping at straws.... you may be able to do better if you separate the list into two objects, and pin them in memory (or even just pad them within a struct) so they are far enough apart that they won't have a cache conflict.
Upvotes: 4