CUDA: Only one job to begin with

Question

Sorry for bad title. I could not come up with anything better.

Every example I have seen of CUDA programs has predefined data that is ready to be parallelized. A common example is the sum of two matrices where the two matrices are already filled. But what about programs that generates new tasks. How do I model this in CUDA? How do I pass a result so other threads can begin working on it.

For example: Say I run a kernel on one job. This job generates 10 new independant jobs. Each of them generates 10 new independant job and so on. This seems like a task that is highly parallel because each job is independant. The problem is I don't know how to model this in CUDA. I have tried doing it in CUDA where I used a while loop in a kernel to keep polling if a thread could begin computation. Each thread was assigned a job. But that did not work. It seemed to ignore the while loop.

Code example:

On host:
fill ready array with 0
ready[0] = 1;

On device:
__global__ void kernel(int *ready, int *result)
{
    int tid = threadIdx.x;
    if(tid < N)
    {
        int condition = ready[tid];
        while(condition != 1)
        {
            condition = ready[tid];
        }

        result[tid] = 3;// later do real computation

        //children jobs is now ready to work
        int childIndex = tid * 10;
        if(childIndex < (N-10))
        {
            ready[childIndex + 1] = 1; ready[childIndex + 2] = 1;
            ready[childIndex + 3] = 1; ready[childIndex + 4] = 1;
            ready[childIndex + 5] = 1; ready[childIndex + 6] = 1;
            ready[childIndex + 7] = 1; ready[childIndex + 8] = 1;
            ready[childIndex + 9] = 1; ready[childIndex +10] = 1;
        }
    }
}

onit · Accepted Answer

You will want to use multiple kernel calls. Once a kernel job has finished and generated the work units for its children, the children can be executed in another kernel. You don't want to poll with a while loop inside a cuda kernel anyways, even if it worked you would get terrible performance.

I would google the CUDA parallel reduction example. Shows how to decompose into multiple kernels. The only difference is instead of doing less work between kernels you will be doing more.

CUDA: Only one job to begin with

Answers (2)

Related Questions