Random NaN and incorrect results with OpenCL kernel

Question

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.

The Kernel

I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:

__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
                   __global const float* A, __global const float* B, const float beta, 
                   __global float* C, __local float* Asub, __local float* Bsub) {

  const uint row = get_local_id(0);
  const uint col = get_local_id(1);
  const uint TS = get_local_size(0); // Tile size
  const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
  const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)

  // Initialise the accumulation register
  float acc = 0.0f;

  // Loop over all tiles
  const int numtiles = K / TS;
  for (int t = 0; t < numtiles; t++) {
    const int tiledRow = TS * t + row;
    const int tiledCol = TS * t + col;
    Asub[col * TS + row] = A[tiledCol * M + globalRow];
    Bsub[col * TS + row] = B[globalCol * K + tiledRow];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(int k = 0; k < TS; k++) {
      acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
    }

    barrier(CLK_LOCAL_MEM_FENCE);
  }

  C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}

Tile Size (TS) is now a value defined in the calling code, which looks like this:

  // A, B and C are 2D matrices, their cl::Buffers have already been set up
  // and values appropriately set.

  kernel.setArg(0, (cl_int)nrowA);
  kernel.setArg(1, (cl_int)ncolB);
  kernel.setArg(2, (cl_int)ncolA);
  kernel.setArg(3, alpha);
  kernel.setArg(4, A_buffer);
  kernel.setArg(5, B_buffer);
  kernel.setArg(6, beta);
  kernel.setArg(7, C_buffer);
  kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
  kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));

  cl::NDRange global(nrowA, ncolB);
  cl::NDRange local(nrowA, ncolB);

  status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);

The Problem

The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)

I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.

However, I can get this correct result maybe just 75% of the time.

Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.

I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.

What am I doing wrong here?

Things I have tried

I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.

Other points to note

I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.

Random NaN and incorrect results with OpenCL kernel

The Kernel

The Problem

Things I have tried

Other points to note

Answers (1)

Initializing local memory

Your changes

Update 1

Related Questions