Almero
Almero

Reputation: 1

OpenCL Kernel performs slower on faster GPU

I am fairly new to OpenCL and have been attempting to implement a DSP algorithm to compare its performance on different GPU's compared to the standard CPU implementation. Though I have achieved a massive performance gain, what I find strange is that I get the almost the same gain on GT240 as a much faster GTX 480. My program executes two kernels, and while the one speeds up on the GTX 480 the other slows down.

GT240: Kernel 1: 226us, Kernel 2: 103us.

GTX 480: Kernel 1: 35us, Kernel 2: 293us.

These numbers were obtained using Visual Profiler. Below is the code for Kernel 2, which is almost 3 times slower on the bigger card. This kernel takes a block of memory which is iTotalBins x iNumAngles big, and computes the max for each row of iNumAngles length, and fits a curve to the 3 adjacent values.

__kernel void max_curve_fit_gpu (__global float* fCorrelationResult,
                          const int iNumAngles,
                          const int iTotalBins,
                          __global float* fResult){

// Get the thread ID which is used as the index the bin the direction is being calculated for
const int iBinNum = get_global_id(0);
const int iOffset = iBinNum*iNumAngles;

// Find the max for this bin
float fMax = 0;
int iMaxIndex = 0;
for (int iAngle=0; iAngle<iNumAngles; iAngle++)
{
    if (fMax < fCorrelationResult[iOffset + iAngle])
    {
        fMax = fCorrelationResult[iOffset + iAngle];
        iMaxIndex = iAngle;
    }
}

// Do the curve fit
float fPrev, fNext, fA, fB, fAxis;
fPrev = fCorrelationResult[iOffset + (iMaxIndex + iNumAngles - 1) % iNumAngles];
fNext = fCorrelationResult[iOffset + (iMaxIndex + iNumAngles + 1) % iNumAngles];

fB = (fPrev - fNext)*0.5f;
fA = (fNext + fPrev) - fMax*2.0f;
fAxis = fB / fA;

    // Store the result
fResult[iBinNum] = iMaxIndex + fAxis; }

Visual Profiler also indicates that there is 135% Global memory instruction replay for Kernel 2. I have a version of the max search which does not use a if-else statment, but it runs even slower on both GPU's.

Any help will be greatly appreciated.

Upvotes: 0

Views: 666

Answers (2)

Eric Bainville
Eric Bainville

Reputation: 9906

In your code, thread T will access fCorrelationResult[T*iNumAngles+iAngle], meaning you have no coalesced accesses, and probably memory bank conflicts too. Bank conflicts may explain the phenomenon you observe.

You should transpose your matrix, and access fCorrelationResult[T+iAngle*iNumBins] instead. You will certainly get a nice speedup, and probably more regular benchmarks between the two GPU's.

Upvotes: 3

Dmitry Leskov
Dmitry Leskov

Reputation: 3165

With OpenCL, is it possible to go the the lower level and find out something about register and shared memory usage when the kernel is run on a particular GPU?

From my limited exposure to NVIDIA CUDA, utilization may be key here. GT240 is compute capability 1.2 and GTX480 is 2.0, so the latter has 2x the registers and 3x shared memory. My guess is that the code produced by OpenCL for the second kernel fails to utilize these resources on the 480. There may be a shared memory bank conflict, for instance.

Upvotes: 0

Related Questions