Infinite loop in CUDA kernel

Question

I have a CUDA kernel where every thread traverse a tree. Because of this I have a while loop which is looped until the thread reaches a leaf. In every step down the tree it checks which of the children it should pick to follow.

The code is as follows:

__global__ void search(float* centroids, float* features, int featureCount, int *votes)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if(tid < featureCount)
    {
        int index = 0;
        while (index < N) 
        {
            votes[tid] = index;
            int childIndex = index * CHILDREN + 1;
            float minValue = FLT_MAX;

            if(childIndex >= (N-CHILDREN)) break;

            for(int i = 0; i < CHILDREN; i++)
            {
                int centroidIndex = childIndex + i;
                float value = distance(centroids, features, centroidIndex, tid);
                if(value < minValue)
                {
                    minValue = value;
                    index = childIndex + i;
                }
            }
        }
        tid += blockDim.x * gridDim.x;
    }
}

__device__ float distance(float* a, float* b, int aIndex, int bIndex)
{
    float sum = 0.0f;
    for(int i = 0; i < FEATURESIZE; i++)
    {
        float val = a[aIndex + i] - b[bIndex + i];
        sum += val * val;
    }

    return sum;
}

This code goes into an infinite loop. That is what I find weird. If I change the distance method to return a constant it works(ie. traversing left in the tree).

Have I missed something with loops in CUDA or is there just some hidden bug I can't see? Because I don't see how the code can go into an infinite loop.

Jared Hoberock · Accepted Answer

Loops in CUDA C++ have the same semantics as they do in C++, so there must be a bug somewhere in your code. One strategy for debugging it would be to do so on the host.

First, because your code is scalar (e.g., it contains no calls to __syncthreads), you can refactor it into __host__ __device__ functions.

distance contains no CUDA-specific identifiers or functions, so you can simply prepend __host__:

__host__ __device__ float distance(float* a, float* b, int aIndex, int bIndex);

To refactor your search function, hoist tid (which depends on the CUDA-specific identifiers threadIndex et al.) outside of it into a parameter, and make it a __host__ __device__ function:

__host__ __device__ void search(int tid, float* centroids, float* features, int featureCount, int *votes)
{
  if(tid < featureCount)
  {
    int index = 0;
    while (index < N) 
    {
      votes[tid] = index;
      int childIndex = index * CHILDREN + 1;
      float minValue = FLT_MAX;

      if(childIndex >= (N-CHILDREN)) break;

      for(int i = 0; i < CHILDREN; i++)
      {
        int centroidIndex = childIndex + i;
        float value = distance(centroids, features, centroidIndex, tid);
        if(value < minValue)
        {
          minValue = value;
          index = childIndex + i;
        }
      }
    }
  }
}

Now write a __global__ function which does nothing except calculate tid and call search:

__global__ void search_kernel(float *centroids, float features, int featureCount, int *votes)
{
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  search(tid, centroids, features, featureCount, votes); 
}

Because search is now __host__ __device__, you can debug it by calling it from the CPU, emulating what a kernel launch would do:

for(int tid = 0; tid < featureCount; ++tid)
{
  search(tid, centroids, features, featureCount, votes);
}

It should hang on the host exactly as it would on the device. Stick a printf inside to find out where. Of course, you need to be sure to make host-side copies of your arrays such as centroids, because the host cannot dereference pointers to device memory.

Even though printf is available to use from __device__ functions with newer hardware, the reason you might prefer this approach is that calls to printf from a kernel do not commit until after the kernel retires. If the kernel never retires (as it apparently does not in your case) then your debugging output will never appear on the screen.

Infinite loop in CUDA kernel

Answers (1)

Related Questions