Reputation: 11916
There are many 4 element, 6-element(input) ... 16-input sorting networks but I need a 32 input version to have a 32x32 shear-sort algorithm(which I plan as an Opencl auxiliary function) to have a 1024x1024 shear-sort opencl algorithm. How can I derive my 32-input sorting network?
or just found by trial and error?
Input array: 1M elements ----> 1024x1024 2D matrix with inverted odd-rows (shear)
each row(1024) of matrix --------> 32 x 32 2D matrix (shear)
32 element row ---------> Sorting (network)
Each thread computes one row of 1024 elements. So only 1024 threads for 1M element array.
The non-divergent comparison I plan to use in the network is:
if(a>b) // where a and b are between 0 and 16M
swap(a,b)
becomes
a0=a; b0=b; // saving
c = a-b
d = !(sign bit of c) (0 for negative, 1 for positive)
tmp=b*d; //tmp=a if a>b otherwise 0
a=a*d //a=b if a>b otherwise 0
b=tmp*d; //b=tmp if a>b otherwise 0
// a0 is backup of a, b0 is backup of b
e = (sign bit of c) (1 for negative, 0 for positive)
tmp0=a0*e; //tmp0=a0 if a0<=b0 otherwise 0
a0=b0*e //a0=b0 if a0<=b0 otherwise 0
b0=tmp0*e; //b0=tmp0 if a0<=b0 otherwise 0
aOut=a+a0; // only a or a0 can be different than zero
bOut=b+b0; // only b or b0 can be different than zero
Im sure this is not fastest one but I need to make a quick easy sorting to try on my particle constraint solver which screams sorting for fixed spatial indexing (grid), I have 1M particles and trying a shear of shear of network sorting.
To validate the shear sorting, I implemented 32-input sorting serial bitonic sorter per thread basis to build 32x32 matrix each column and row sorted. So 32x32 = 1024 element sorting took 9 ms and this is too slow for 32 cores @ 700 MHz.
1024 element sorting takes 9 ms and at least 20 iterations will be needed after each 1024 sorting to sort 1M array. Even if it gets 90 ms, this is too slow for just keys. There will be many values bound to keys.(more than 100)
Tried bubblesort in place of bitonic and got 10ms so the problem must be in the shear sort implementation may be?
Upvotes: 3
Views: 1212
Reputation: 22572
Currently, the sorting network known to use the least compare-exchange units to sort 32 elements works as follows: sort the first 16 elements with the most efficient sorting network of size 16, do the same for the following 16 elements, then use the merge step from Batcher's odd-even mergesort. Basically, it gives the following pairs of compare-exchange units:
Sort the first half of the array:
[0,1],[2,3],[4,5],[6,7],[8,9],[10,11],[12,13],[14,15],
[0,2],[4,6],[8,10],[12,14],[1,3],[5,7],[9,11],[13,15],
[0,4],[8,12],[1,5],[9,13],[2,6],[10,14],[3,7],[11,15],
[0,8],[1,9],[2,10],[3,11],[4,12],[5,13],[6,14],[7,15],
[5,10],[6,9],[3,12],[13,14],[7,11],[1,2],[4,8],
[1,4],[7,13],[2,8],[11,14],[5,6],[9,10],
[2,4],[11,13],[3,8],[7,12],
[6,8],[10,12],[3,5],[7,9],
[3,4],[5,6],[7,8],[9,10],[11,12],
[6,7],[8,9]
Sort the second half of the array:
[16,17],[18,19],[20,21],[22,23],[24,25],[26,27],[28,29],[30,31],
[16,18],[20,22],[24,26],[28,30],[17,19],[21,23],[25,27],[29,31],
[16,20],[24,28],[17,21],[25,29],[18,22],[26,30],[19,23],[27,31],
[16,24],[17,25],[18,26],[19,27],[20,28],[21,29],[22,30],[23,31],
[21,26],[22,25],[19,28],[29,30],[23,27],[17,18],[20,24],
[17,20],[23,29],[18,24],[27,30],[21,22],[25,26],
[18,20],[27,29],[19,24],[23,28],
[22,24],[26,28],[19,21],[23,25],
[19,20],[21,22],[23,24],[25,26],[27,28],
[22,23],[24,25]
Odd-even merge the two halves of the array:
[0, 16],
[8, 24],
[8, 16],
[4, 20],
[12, 28],
[12, 20],
[4, 8],
[12, 16],
[20, 24],
[2, 18],
[10, 26],
[10, 18],
[6, 22],
[14, 30],
[14, 22],
[6, 10],
[14, 18],
[22, 26],
[2, 4],
[6, 8],
[10, 12],
[14, 16],
[18, 20],
[22, 24],
[26, 28],
[1, 17],
[9, 25],
[9, 17],
[5, 21],
[13, 29],
[13, 21],
[5, 9],
[13, 17],
[21, 25],
[3, 19],
[11, 27],
[11, 19],
[7, 23],
[15, 31],
[15, 23],
[7, 11],
[15, 19],
[23, 27],
[3, 5],
[7, 9],
[11, 13],
[15, 17],
[19, 21],
[23, 25],
[27, 29],
[1, 2],
[3, 4],
[5, 6],
[7, 8],
[9, 10],
[11, 12],
[13, 14],
[15, 16],
[17, 18],
[19, 20],
[21, 22],
[23, 24],
[25, 26],
[27, 28],
[29, 30]
I generated the previous indices pairs with the oddeven_merge
algorithm as given on the Wikipedia page. I can't guarantee that it will be faster than what you already have, but it will at least lower the number of compare-exchange units from 191 (with Batcher's odd-even mergesort) to 185. I have read research papers on the matter and it seems that we currently don't know sorting networks with less than 185 comparators to sort 32 elements.
Upvotes: 2
Reputation: 11916
I found some valuable info which is called Hasse Diagrams. Reading it, but solving will take some time(maybe i will never be able to) so I searched and found below already-completed solution:
Network for N=32, using Batcher's Merge-Exchange. (192 comparators) http://jgamble.ripco.net/cgi-bin/nw.cgi?inputs=32&algorithm=batcher&output=svg
Using this to sort 32 arrays each 32 elements (no for loop, just hand written comparators), and applying a shear sort on 32 cores, kernel time decreased from 9ms to 2ms compared to bitonic sort(for loop) + shear sort(for loops and parallel).
This is 4x speedup from going for-looped bitonic sorter to non-loop Batcher's Merge Exchange.
From this:
void MergeSort32(int * RegisterArray, int dir)"
{"
int n=32;"
for (int s=2; s <= n; s*=2)
{
for (int i=0; i < n;)
{
merge_up((RegisterArray+i),s,dir);
merge_down((RegisterArray+i+s),s,dir);
i += s*2;
}
}
}
to
swapx(arr,0,16,dir);
swapx(arr,1,17,dir);
swapx(arr,2,18,dir);
swapx(arr,3,19,dir);
swapx(arr,4,20,dir);
swapx(arr,5,21,dir);
swapx(arr,6,22,dir);
swapx(arr,7,23,dir);
...
... 191 lines of branchless comparators-swappers
(branching version crashes for some reason, maybe because of
inlined 382 if sentences per core
xor swap idiom shouldnt be a problem.)
But opencl compiling time increased from 0.5 seconds to 7.5 seconds. Adding -cl-opt-disable to options makes compiling forever until I ctrl+alt+del. So auto optimization options must be optimizing the compiling part too.
Edit: shear sort (1M) built by shear sorts(1k each) by Batcher's (32 elements each) : 0.341 seconds. HD7870, using 64 threads per work group. Sort is working as validated but much slower than a single core cpu shell sort (0.050 seconds).
Upvotes: 0