Reputation: 3153
I have a CUDA kernel where every thread traverse a tree. Because of this I have a while loop which is looped until the thread reaches a leaf. In every step down the tree it checks which of the children it should pick to follow.
The code is as follows:
__global__ void search(float* centroids, float* features, int featureCount, int *votes)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid < featureCount)
{
int index = 0;
while (index < N)
{
votes[tid] = index;
int childIndex = index * CHILDREN + 1;
float minValue = FLT_MAX;
if(childIndex >= (N-CHILDREN)) break;
for(int i = 0; i < CHILDREN; i++)
{
int centroidIndex = childIndex + i;
float value = distance(centroids, features, centroidIndex, tid);
if(value < minValue)
{
minValue = value;
index = childIndex + i;
}
}
}
tid += blockDim.x * gridDim.x;
}
}
__device__ float distance(float* a, float* b, int aIndex, int bIndex)
{
float sum = 0.0f;
for(int i = 0; i < FEATURESIZE; i++)
{
float val = a[aIndex + i] - b[bIndex + i];
sum += val * val;
}
return sum;
}
This code goes into an infinite loop. That is what I find weird. If I change the distance method to return a constant it works(ie. traversing left in the tree).
Have I missed something with loops in CUDA or is there just some hidden bug I can't see? Because I don't see how the code can go into an infinite loop.
Upvotes: 0
Views: 2092
Reputation: 11406
Loops in CUDA C++ have the same semantics as they do in C++, so there must be a bug somewhere in your code. One strategy for debugging it would be to do so on the host.
First, because your code is scalar (e.g., it contains no calls to __syncthreads
), you can refactor it into __host__ __device__
functions.
distance
contains no CUDA-specific identifiers or functions, so you can simply prepend __host__
:
__host__ __device__ float distance(float* a, float* b, int aIndex, int bIndex);
To refactor your search
function, hoist tid
(which depends on the CUDA-specific identifiers threadIndex
et al.) outside of it into a parameter, and make it a __host__ __device__
function:
__host__ __device__ void search(int tid, float* centroids, float* features, int featureCount, int *votes)
{
if(tid < featureCount)
{
int index = 0;
while (index < N)
{
votes[tid] = index;
int childIndex = index * CHILDREN + 1;
float minValue = FLT_MAX;
if(childIndex >= (N-CHILDREN)) break;
for(int i = 0; i < CHILDREN; i++)
{
int centroidIndex = childIndex + i;
float value = distance(centroids, features, centroidIndex, tid);
if(value < minValue)
{
minValue = value;
index = childIndex + i;
}
}
}
}
}
Now write a __global__
function which does nothing except calculate tid
and call search
:
__global__ void search_kernel(float *centroids, float features, int featureCount, int *votes)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
search(tid, centroids, features, featureCount, votes);
}
Because search
is now __host__ __device__
, you can debug it by calling it from the CPU, emulating what a kernel launch would do:
for(int tid = 0; tid < featureCount; ++tid)
{
search(tid, centroids, features, featureCount, votes);
}
It should hang on the host exactly as it would on the device. Stick a printf
inside to find out where. Of course, you need to be sure to make host-side copies of your arrays such as centroids
, because the host cannot dereference pointers to device memory.
Even though printf
is available to use from __device__
functions with newer hardware, the reason you might prefer this approach is that calls to printf
from a kernel do not commit until after the kernel retires. If the kernel never retires (as it apparently does not in your case) then your debugging output will never appear on the screen.
Upvotes: 4