Reputation: 1146
I originally wrote an OpenCL program to calculate very large hermitian matrices, where the kernel calculates a single pair of entries in the matrix (the upper triangular portion, and its lower triangular complement).
Very early on, I found a very odd problem in that, if my kernel size is exactly 55, the 27th kernel thread would not execute. This problem only occurs when using the nVidia driver and GPU acceleration. When I run it using the Intel driver on the CPU, I find the 27th kernel thread executes just fine. Larger and smaller kernel sizes don't seem to exhibit the problem.
Thinking it might be something in my code, I distilled my problem down to the following very simple kernel:
__kernel void testIndex(__global float* outMatrix, unsigned int sizeN)
{
//k is the linear kernel ID (related to but not exactly the linear index into the outMatrix)
int k = get_global_id(0);
//i'th index (Row or Y)
int i = floor((2 * sizeN+1 - sqrt((float)((2 * sizeN + 1) * (2 * sizeN + 1) -8 * k) )) /2);
//j'th index (Column or X)
int j = k - sizeN * i + i * (i - 1) / 2;
j += i;
//Index bounds check... If we're greater than sizeN, we're an idle core.
//(OpenCL will queue up a fixed block size of worker threads, some of them may be out of bounds)
if(j >= sizeN || i >= sizeN)
{
return;
}
//Identity case. The original kernel did some special stuff here,
//but I've just replaced it with the K index code.
if(i == j)
{
outMatrix[i * sizeN +j] = k;
return;
}
outMatrix[i * sizeN + j] = k;
//Since we only have to calculate the upper triangle of our matrix,
//(the lower triangle is just the complement of the upper),
//this test sets the lower triangle to -9999 so it's easier to see
//how the indexing plays out...
outMatrix[j * sizeN + i] = -9999.0;
}
outMatrix is the output matrix, and sizeN is the size of the square matrix on a side (i.e. the matrix is sizeN x sizeN).
I calculate and execute my kernel size using the following host code:
size_t kernelSize = elems * (elems + 1) / 2;
cl::NDRange globalRange(kernelSize);
cl::NDRange localRange(1);
cl::Event event;
clCommandQueue.enqueueNDRangeKernel(testKernel, cl::NullRange, globalRange, cl::NullRange, NULL, &event);
event.wait();
elems is the same as sizeN (i.e. the square root of the matrix size). In this case, elems = 10 (thus giving a kernel size of 55).
If I print out the matrix that I read back, I get the following (using boost ublas matrix formatting):
[10,10] (( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
((-9999, 10, 11, 12, 13, 14, 15, 16, 17, 18),
((-9999, -9999, 19, 20, 21, 22, 23, 24, 25, 26),
((-9999, -9999, -9999, JUNK, 28, 29, 30, 31, 32, 33),
((-9999, -9999, -9999, -9999, 34, 35, 36, 37, 38, 39),
((-9999, -9999, -9999, -9999, -9999, 40, 41, 42, 43, 44),
((-9999, -9999, -9999, -9999, -9999, -9999, 45, 46, 47, 48),
((-9999, -9999, -9999, -9999, -9999, -9999, -9999, 49, 50, 51),
((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, 52, 53),
((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, 54))
Where "JUNK" is a random value based on whatever happens to be in that memory at the time. This is of course suspicious, as 27 is is basically the exact halfway point in the kernel.
Just for completeness, the matrix result is read back using the following code:
boost::scoped_array<float> outMatrixReadback(new float[elems * elems]);
clCommandQueue.enqueueReadBuffer(clOutputMatrixBuffer, CL_TRUE, 0, elems * elems * sizeof(float), outMatrixReadback.get());
I am making the (perhaps incorrect) assumption that since the code executes fine on an Intel CPU, that there is not some fundamental bug in the code itself.
So then, is there perhaps some gotcha I'm not aware of when programming OpenCL on an nVidia card, or am I unfortunate enough to have found a driver bug?
Hardware/OS specs
nVidia GTX 770
RHEL Server release 6.4 (Santiago)
Intel OpenCL 1.2 4.4.4.0.134 SDK headers
nVidia GeForce driver 384.69
Intel Xeon CPU E6520 @ 2.4 GHz
Upvotes: 16
Views: 1178
Reputation: 1
Here are NVIDIA replies, one is working around and one is solution. We just post in our bug system but no get you reply, so we post the work around/solution here. Thank you!
1. Work around: We got local reproduce on this issue and come out a workaround method from our developer team, please have a try below modification and let us know if it's working or not. Thank you for your patient.
Here is a workaround you can try. Change the sqrt() to use pow() instead in file testIndex.cl, see snippet below.
//i'th index (Row or Y)
// original version
// FAIL float sqrt
//int i = floor((2 * sizeN+1 - sqrt((float)((2 * sizeN + 1) * (2 * sizeN + 1) -8 * k) )) /2);
// PASS float pow
//int i = floor((2 * sizeN+1 - pow((float)((2 * sizeN + 1) * (2 * sizeN + 1) -8 * k), 0.5f)) /2);
2. Solution: There has new solution for the problem today, please review below description and method to resolve the problem and let us know if it's working or not. Thank you.
OpenCL 1.2 spec, section 5.6.4.2 says: -cl-fp32-correctly-rounded-divide-sqrt. The -cl-fp32-correctly-rounded-divide-sqrt build option to clBuildProgram or clCompileProgram allows an application to specify that single precision floating-point divide (x/y and 1/x) and sqrt used in the program source are correctly rounded. If this build option is not specified, the minimum numerical accuracy of single precision floating-point divide and sqrt are as defined in section 7.4 of the OpenCL specification.
In Section 7.4, the table says: sqrt <= 3 ulp
The 2 values produces here: root = 15.0000009537 root = 15.0000000000 differ only by 1ULP, which is allowed by the standard. See https://en.wikipedia.org/wiki/Unit_in_the_last_place for an intro into ULPs. You essentially needs to use "-cl-fp32-correctly-rounded-divide-sqrt" option at opencl compilation, by specifying program.build(devices, "-cl-fp32-correctly-rounded-divide-sqrt");
Upvotes: 0
Reputation: 1146
After discussions with nVidia, this was confirmed to be both repeatable and a driver bug by a technical rep. A bug report was submitted, but unfortunately I was informed nVidia doesn't have a dedicated OpenCL dev team, so a timeline on a fix can't be provided.
Edit: After finally hearing back from nVidia, the workaround is apparently to use pow() instead of sqrt() in the CL kernel, as sqrt() is apparently the source of the bug.
Upvotes: 2