Reputation: 1377
I have some C code for merge two sorted arrays:
void merge(int m, int n, int A[], int B[], int C[]) {
int i, j, k;
i = 0;
j = 0;
k = 0;
while (i < m && j < n) {
if (A[i] <= B[j]) {
C[k] = A[i];
i++;
} else {
C[k] = B[j];
j++;
}
k++;
}
if (i < m) {
for (int p = i; p < m; p++) {
C[k] = A[p];
k++;
}
} else {
for (int p = j; p < n; p++) {
C[k] = B[p];
k++;
}
}
}
I want to put merge part to OpenCL kernel, what the best way to do this? Or what the best way to merge two sorted arrays with OpenCL?
Upvotes: 2
Views: 1696
Reputation: 50
The easiest way would be to create three buffers A, B and C, and then call two clEnqueueCopyBuffer() as follows:
clEnqueueCopyBuffer( cmdQueue, A, C, 0, 0, m, 0, NULL, NULL );
clEnqueueCopyBuffer( cmdQueue, B, C, 0, m, n, 0, NULL, NULL );
If you want a naive kernel which does it, the following would work:
__kernel void merge(int m, __global const int* A, __global const int* B, _global int* C )
{
int id= (int)get_global_id(0);
if( id<m )
{
C[id]=A[id];
}
else
{
C[id]=B[id-m];
}
}
This kernel is in no way optimized. There are many ways to optimize depending on the device.
Upvotes: 0
Reputation: 5087
If your arrays' lengths are the same power of two, you can us bitonic sorting. Just start from the final butterfly step (last block of the blue/brown diagram in the wiki link), and you will saturate the gpu while getting get the most out of the memory speed of the device. You can also pad your arrays if they are close to a power of two. I have successfully sorted lists of several million (eg 2^20 .. 2^24) entries with this method. See: 'Bitonic Sorter' Wiki
If you have an arbitrary number of elements in each array, it may not be worth the transfer time when you are dealing with two lists that are already sorted. This is because you would be comparing only two values at a time, and moving one of them to the result list. This is a terrible use of the gpu, because you are basically single-threaded. An optimization might be to load the first 4-8kb from each of your source arrays into local memory, then write the sorted block to local memory as well. You would still only use one compute unit of your entire gpu, but the memory speeds would be great. Again, probably not worth the trouble. Your cpu L1 and L2 data cache and superior clock speed should outperform the gpu when merging arbitrary length, sorted arrays.
Upvotes: 3