Reputation: 93
I have the following code setup:
#define TSIZE 32
#define TNUM 24000000
#define CORES 4
/* Byte-wise swap two items of size SIZE. */
#define SWAP(a, b, size) \
do \
{ \
size_t __size = (size); \
char *__a = (a), *__b = (b); \
do \
{ \
char __tmp = *__a; \
*__a++ = *__b; \
*__b++ = __tmp; \
} while (--__size > 0); \
} while (0)
char* TWEETS;
size_t partition(void* arr, size_t left, size_t right, int (*compar)(const void* , const void*))
{
char* cArr = (char*) arr;
size_t i;
size_t pivotIndex = (left+right)/2;
char* pivotValue = &cArr[(size_t)TSIZE * right];
size_t index = left;
SWAP(&cArr[(size_t)TSIZE * pivotIndex], &cArr[(size_t)TSIZE * right], (size_t)TSIZE);
for(i = left; i < right; i++) {
if(compar((void*) &cArr[(size_t)TSIZE * i], (void*) pivotValue) < 0) {
SWAP(&cArr[(size_t)TSIZE * i], &cArr[(size_t)TSIZE * index], (size_t)TSIZE);
index++;
}
}
SWAP(&cArr[(size_t)TSIZE * index], &cArr[(size_t)TSIZE * right], (size_t)TSIZE);
return index;
}
void quicksort(void* base, size_t left, size_t right, int (*compar)(const void* , const void*))
{
if(left < right) {
size_t pivot = partition(base, left, right, compar);
#pragma omp task
quicksort(base, left, pivot-1, compar);
#pragma omp task
quicksort(base, pivot+1, right, compar);
}
}
int main(int argc, char** argv) {
omp_set_dynamic(0);
omp_set_num_threads(CORES);
TWEETS = (char*) malloc((size_t)TNUM * (size_t)TSIZE * (size_t)CORES * (size_t)sizeof(char));
if(TWEETS == NULL) exit(1);
readData();
#pragma omp parallel
{
#pragma omp single
quicksort(TWEETS, 0, ((size_t)CORES*(size_t)TNUM)-(size_t)1, compare);
}
free(TWEETS);
}
So first of all, forgive the huge amount of (size_t) casts, I did this in a fit of desperation.
What am I doing here
I am reading in a text file with 24 million lines of text, each line contains 32 bytes of characters. The lines are then sorted according to the compare function, which I have omitted here. I can guarantee that this function works and is not the cause of my trouble. At all times it returns either -1, 0 or 1.
I am also trying to parallelize the quicksort algorithm. The lines of code grow along with the amount of cores I use, e.g. 1 core = 24million, 2 cores = 48 million and so on.
What is working already
Working already is sorting the file with 1 to 8 cores as long as the file size stays below 48 million lines of text.
What is my problem
My problem is that, once I try to sort a file with 72 million lines of text or more the quicksort algorithm runs into a segmentation fault. I have debugged the code with gdb as far as I could, and the code that is at fault is this line:
SWAP(&cArr[(size_t)TSIZE * i], &cArr[(size_t)TSIZE * index], (size_t)TSIZE);
It's the swap call in the partition function in the for loop. I could also see that, at this point, the variable "right" had the value 18446744073709551615 (2^64-1) which is the cause of the segmentation fault. The maximum value "right" should ever have is TSIZE * TNUM * CORES. Since the number is that huge my only guess is that an overflow happens somewhere in the algorithm.
And well, here's the catch: The algorithm and the entire program work flawlessly when staying <= 48 million lines of text. Once I go beyond that the segfault happens. I have also made sure that reading in the data works, meaning after the process of reading in the data about 3gb of my RAM are in use. The segfault definitely occurs during the sorting of the char array.
So why is it working with up to 48.000.000 lines of text, and why is it segfaulting when having more than that? Where is my mistake?
Upvotes: 3
Views: 363
Reputation:
The one pass partition operation is simple and cute, but it does much more swapping than it needs to. As written, in a little test I found it does two-and-a-half times as many swaps as required !
There are two issues with the 'cute' partition:
when index
and i
are the same, and the value is < pivot value, then it does a swap of itself to itself. If the swap is expensive, then a test for "self" will save a little.
when it swaps, it is moving the value at the index
forwards to i
, and stepping both. If the index
arrives at the value, it may have to swap it forwards again -- doing "extra" (unnecessary) swaps, to shuffle values along ahead of the advancing `index'.
Consider partitioning five values: 9 3 5 1 4
. First, 5 will be chosen as the pivot-value, and swapped with the 4, giving 9 3 4 1 5
. Then, starting with index == i == left
, the 9 is greater than the pivot, so leave index
and advance i
. Now 3 is less than the pivot, so we swap 9 and 3 and advance both index
and i
, giving 3 9 4 1 5
. Now 4 is less than the pivot, so swap again, shuffling the 9 forwards, giving 3 4 9 1 5
. And 1 is also less than the pivot, so swap again, giving 3 4 1 9 5
. Finally, swapping the pivot value into place completes the process to give 3 4 1 5 9
.
So, this does 3 swaps to shuffle the 9 along, where only 1 swap is required.
A common, but less simple, way of doing the partition scans first from the left and then from the right, looking for values that need to be swapped up and down, so that the process is done in the minimum number of swaps.
I tried this for vectors of 50 integers, each value selected at random from 1..50. I ran the 'cute' partition and a more 'Traditional' one, and counted the swaps over 20,000 trials. I got:
Average 'trad' swaps = 9.8
Average 'cute' swaps = 23.0
Average 'cute' selfs = 2.9
Average 'cute' extra = 13.0
where 'trad' minimises the number of swaps. The "selfs" is swaps where index
== i
, and extras
are where a value is swapped forward more than once. The 'cute' swaps count includes the 'extras', but not the 'selfs' (because the 'selfs' are trivial to eliminate).
The swap counts include swapping out the pivot value and swapping it back in again -- because those are the same for both algorithms. Discounting those two swaps, the 'cute' algorithm is doing 23.9/7.8 or three times as many swaps as are required (on average, on my little test).
So, however cute it is, the simple, one pass partition is rubbish.
For completeness, here is a more "traditional" partition:
/* 'left' and 'right' are indexes into 'data', and there are
* are 'right' - 'left' + 1 items in the partition.
*
* 'pivot' is the index of the chosen pivot-value, and is set
* to the (new) index for that when the partition completes.
*/
pv = data[pivot] ;
swap(data, pivot, right) ;
l = left ;
r = right ;
while (l < r)
{
--r ;
while ((l < r) && (data[l] <= pv))
++l ;
while ((l < r) && (data[r] >= pv))
--r ;
if (l == r)
{
if (data[r] < pv)
++r ;
break ;
} ;
swap(data, l, r) ;
++l ;
} ;
pivot = r ;
swap(data, pivot, right) ;
Upvotes: 0
Reputation: 66194
You have an edge case in your algorithm that isn't accounted for.
If the bottom (left edge) partition of the original sequence ever encounters no swaps (i.e. every value is "greater-or-equal" than the pivot), then index
, which started out at zero (0
), will remain as such. The index i
will march to the end. The pivot value that your temporarily storing in the right-most slot is then swapped into place (i.e. cArr[0]
and cArr[right]
are swapped), and you return 0
from the function. In other words, this:
size_t pivot = partition(base, left, right, compar);
#pragma omp task
quicksort(base, left, pivot-1, compare);
// here ================^
executes withpivot
being returned as zero from the prior call, passes pivot-1
as right
and results in an underflow. This would give you exactly the value of right
you're getting when you fault. (2^64-1 on every platform I've ever used).
You need to account for this (or never let it happen in the first place). Whether this happens in your code is entirely dependent on the content of each partition that is processed with left=0
. It may not happen the first time, the second time, etc.. But get the right data swapped into that continually decreasing partition space and eventually it can happen.
Untested, But Worth A Look
I'm not a fan of left
and right
partitioning markers in C-implementations of quicksort() in the first place. The language supports pointer math, so use that and tout around something you know is concrete (a base and a length). I've not tested the following, and have only once ever had to deal with OMP, but simplified, what I mean is something like this:
void quicksort(void* base, size_t len, int(*compar)(const void*, const void*))
{
if (len < 2)
return;
char* cArr = (char*)base;
char* pivotValue = cArr + ((size_t)TSIZE * (len - 1));
SWAP(cArr + ((size_t)TSIZE * (len / 2)), pivotValue, TSIZE);
size_t i;
size_t pivot = 0;
for (i = 0; i < len; ++i)
{
if (compar(cArr + ((size_t)TSIZE * i), ) < 0)
{
SWAP(cArr + ((size_t)TSIZE * i), cArr + ((size_t)TSIZE * pivot), (size_t)TSIZE);
++pivot;
}
}
SWAP(cArr + ((size_t)TSIZE * pivot), pivotValue, (size_t)TSIZE);
#pragma omp task
quicksort(cArr, pivot++, compar);
#pragma omp task
quicksort(cArr+((size_t)TSIZE * pivot), len-pivot, compar);
}
I hope it is obvious how this is called.
Upvotes: 2