OpenCL Kernel Troubles

Question

Hi I created two kernels to do a simple matching deshredder program to be run with OpenCL and timed. The two kernels do what they are supposed to do, but one runs far slower than the other for a reason i cannot decipher :/ The only real difference is how i store the data being sent up and how the matching happens.

__kernel void Horizontal_Match_Orig( 
__global int* allShreds, 
__global int* matchOut, 
const unsigned int shredCount, 
const unsigned int pixelCount)

{
    int match = 0;
    int GlobalID = get_global_id(0);
    int currShred = GlobalID/pixelCount;
    int thisPixel = GlobalID - (currShred * pixelCount);
    int matchPixel = allShreds[GlobalID];//currShred*pixelCount+thisPixel];
    for (int i = 0; i < shredCount; i++)
    {

        match = 0;
        if (matchPixel == allShreds[(i * pixelCount) + thisPixel])
        {
            if (matchPixel == 0)
            {
                match = match + 150;
            }
            else match = match + 1;
        }
        else match = match - 50;
        atomic_add(&matchOut[(currShred * shredCount) + i], match);
    }
}

This kernel gets the shred edges horizontally, so the pixels of one shred take up pos 0 to n in the array allShreds and then the pixels of the next shred are stored from pos n+1 to m (Where n = number of pixels, and m = number of pixels added on). Each thread of the GPU gets one pixel to work with and matches it against the corresponding pixel of all the other shreds (including itself)

__kernel void Vertical(
    __global int* allShreds,
    __global int* matchOut,
    const int numShreds,
    const int pixelsPerEdge)
{
    int GlobalID = get_global_id(0);
    int myMatch = allShreds[GlobalID];
    int myShred = GlobalID % numShreds;
    int thisRow = GlobalID / numShreds;
    for (int matchShred = 0; matchShred < numShreds; matchShred++)
    {
        int match = 0;
        int matchPixel = allShreds[(thisRow * numShreds) + matchShred];
        if (myMatch == matchPixel)
        {
            if (myMatch == 0)
                match = 150;
            else
                match = 1;
        }
        else match = -50;
            atomic_add(&matchOut[(myShred * numShreds) + matchShred], match);
    }
}

This kernel gets the shred edges vertically, so the first pixels of all the shreds are stored in pos 0 to n then the 2nd pixels of all the shreds are stored in pos n+1 ot m (Where n = number of shreds, and m = number of shreds added to n). The process is similar to the previous one where each thread gets a pixel and matches it to the corresponding pixel of each of the other shreds.

Both give the same results correct results tested against a purely sequential program. In theory they should both run in roughly the same amount of time, with the possibility of the vertical one running faster as the atomic add shouldn't affect it as much... However it runs far slower... Any Ideas?

This is the code I use to start it (I am using a C# wrapper for it):

theContext.EnqueueNDRangeKernel(1, null, new int[] { minRows * shredcount }, null, out clEvent);

with the total global workload equaling the total number of pixels (#Shreds X #Pixels in each one).

Any help would be greatly appreciated

user703016 · Accepted Answer

The two kernels do what they are supposed to do, but one runs far slower than the other for a reason i cannot decipher :/ The only real difference is how i store the data being sent up and how the matching happens.

And that makes all the difference. This is a classic coalescence problem. You haven't specified your GPU model nor vendor in your question so I'll have to remain vague as actual numbers and behavior are completely hardware dependent, but the general idea is reasonably portable.

Work items in a GPU issue memory requests (reads and writes) together (by "warp" / "wavefront" / "sub-group") to the memory engine. This engine serves memory in transactions (power-of-two sized chunks of 16 to 128 bytes). Let's assume a size of 128 for the following example.

Enter memory access coalescing: if 32 work items of a warp read 4 bytes (int or float) that are consecutive in memory, the memory engine will issue a single transaction to serve all 32 requests. But for every read that is more than 128 bytes apart from another, another transaction needs to be issued. In the worst case, that's 32 transactions of 128 bytes each, which is way more expensive.

Your horizontal kernel does the following access:

allShreds[(i * pixelCount) + thisPixel]

The (i * pixelCount) is constant across work items, only thisPixel varies. Given your code and assuming work item 0 has thisPixel = 0, then work item 1 has thisPixel = 1 and so on. This means your work items are requesting adjacent reads, so you get a perfectly coalesced access. Similarly for the call to atomic_add.

On the other hand, your vertical kernel does the following accesses:

allShreds[(thisRow * numShreds) + matchShred]
// ...
matchOut[(myShred * numShreds) + matchShred]

matchShred and numShreds are constant across threads, only thisRow and myShred vary. This means that you are requesting reads that are numShreds away from each other. This is not sequential access and therefore not coalesced.

OpenCL Kernel Troubles

Answers (1)

Related Questions