Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?

Question

I originally wrote an OpenCL program to calculate very large hermitian matrices, where the kernel calculates a single pair of entries in the matrix (the upper triangular portion, and its lower triangular complement).

Very early on, I found a very odd problem in that, if my kernel size is exactly 55, the 27th kernel thread would not execute. This problem only occurs when using the nVidia driver and GPU acceleration. When I run it using the Intel driver on the CPU, I find the 27th kernel thread executes just fine. Larger and smaller kernel sizes don't seem to exhibit the problem.

Thinking it might be something in my code, I distilled my problem down to the following very simple kernel:

__kernel void testIndex(__global float* outMatrix, unsigned int sizeN)
{
    //k is the linear kernel ID (related to but not exactly the linear index into the outMatrix)
    int k = get_global_id(0);
    //i'th index (Row or Y)
    int i = floor((2 * sizeN+1 - sqrt((float)((2 * sizeN + 1) * (2 * sizeN + 1) -8 * k) )) /2);

    //j'th index (Column or X)
    int j = k - sizeN * i + i * (i - 1) / 2;
    j += i;

    //Index bounds check... If we're greater than sizeN, we're an idle core.
    //(OpenCL will queue up a fixed block size of worker threads, some of them may be out of bounds)
    if(j >= sizeN || i >= sizeN)
    {
        return;
    }

    //Identity case. The original kernel did some special stuff here,
    //but I've just replaced it with the K index code.
    if(i == j)
    {
        outMatrix[i * sizeN +j] = k;
        return;
    }

    outMatrix[i * sizeN + j] = k;

    //Since we only have to calculate the upper triangle of our matrix,
    //(the lower triangle is just the complement of the upper),
    //this test sets the lower triangle to -9999 so it's easier to see
    //how the indexing plays out...

    outMatrix[j * sizeN + i] = -9999.0;

 }

outMatrix is the output matrix, and sizeN is the size of the square matrix on a side (i.e. the matrix is sizeN x sizeN).

I calculate and execute my kernel size using the following host code:

size_t kernelSize = elems * (elems + 1) / 2;
cl::NDRange globalRange(kernelSize);
cl::NDRange localRange(1);
cl::Event event;

clCommandQueue.enqueueNDRangeKernel(testKernel, cl::NullRange, globalRange, cl::NullRange, NULL, &event);
event.wait();

elems is the same as sizeN (i.e. the square root of the matrix size). In this case, elems = 10 (thus giving a kernel size of 55).

If I print out the matrix that I read back, I get the following (using boost ublas matrix formatting):

[10,10] ((    0,     1,     2,     3,     4,     5,     6,     7,     8,    9),
        ((-9999,    10,    11,    12,    13,    14,    15,    16,    17,   18),
        ((-9999, -9999,    19,    20,    21,    22,    23,    24,    25,   26),
        ((-9999, -9999, -9999,  JUNK,    28,    29,    30,    31,    32,   33),
        ((-9999, -9999, -9999, -9999,    34,    35,    36,    37,    38,   39),
        ((-9999, -9999, -9999, -9999, -9999,    40,    41,    42,    43,   44), 
        ((-9999, -9999, -9999, -9999, -9999, -9999,    45,    46,    47,   48),
        ((-9999, -9999, -9999, -9999, -9999, -9999, -9999,    49,    50,   51),    
        ((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999,    52,   53),   
        ((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999,   54))

Where "JUNK" is a random value based on whatever happens to be in that memory at the time. This is of course suspicious, as 27 is is basically the exact halfway point in the kernel.

Just for completeness, the matrix result is read back using the following code:

boost::scoped_array outMatrixReadback(new float[elems * elems]);
clCommandQueue.enqueueReadBuffer(clOutputMatrixBuffer, CL_TRUE, 0, elems * elems * sizeof(float), outMatrixReadback.get());

I am making the (perhaps incorrect) assumption that since the code executes fine on an Intel CPU, that there is not some fundamental bug in the code itself.

So then, is there perhaps some gotcha I'm not aware of when programming OpenCL on an nVidia card, or am I unfortunate enough to have found a driver bug?

Hardware/OS specs

nVidia GTX 770
RHEL Server release 6.4 (Santiago)
Intel OpenCL 1.2 4.4.4.0.134 SDK headers
nVidia GeForce driver 384.69
Intel Xeon CPU E6520 @ 2.4 GHz

stix · Accepted Answer

After discussions with nVidia, this was confirmed to be both repeatable and a driver bug by a technical rep. A bug report was submitted, but unfortunately I was informed nVidia doesn't have a dedicated OpenCL dev team, so a timeline on a fix can't be provided.

Edit: After finally hearing back from nVidia, the workaround is apparently to use pow() instead of sqrt() in the CL kernel, as sqrt() is apparently the source of the bug.

Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?

Answers (2)

Related Questions