Jean-Luc Nacif Coelho
Jean-Luc Nacif Coelho

Reputation: 1022

CUDA how to trick the kernel into thinking it's in another thread?

I'm writing a wrapper for CUDA kernel functions to manage thread allocation as to hide the thread limitations in the GPU. What happens is that since CUDA has a thread limit, the user would have to write a program to manage the threads. What I'm trying to do is hide the thread limit from the user so he can run his kernel in an arbitrary number of threads.

The basic idea is this:

void launch_cuda_kernel_matrix(void (*func)(void*), void* param, unsigned int dim_x, unsigned int dim_y) {
    while (! all threads run) {
        do stuff ...
        fake_func<<max_x, max_y>>(func, param, current_run);
    }
}
void fake_func(void (*func)(void*), void* param, unsigned int current_run) {
    blockIdx.x = blockIdx.x (some math) current_run;
    threadIdx.x = threadIdx.x (some math) current run;
    func(param);
}

So basically my plan would be to trick the kernel by changing the thread and block indexes of the current thread and then call the function from my wrapper with the maximum number of threads available (I will eventually generalize my architecture as to allow multiple dimensions)

Problem is, CUDA disallows me to change thread and block indexes. Is there a way around that?

Also, what is the best way to pass parameters to func without having to resort to void*?

Upvotes: 0

Views: 224

Answers (1)

Nikolay K
Nikolay K

Reputation: 3850

Well I think it is kind of difficult to achive your goal in general. However from your question I can conclude that your functions func has no data dependency between threads (every thread process its own part and have no interactions with other threads). Also suppose func deals with 1 dimension (or maybe 2). Since this in CUDA you can simply run huge number of threads that will be enough in most cases:

  • compute capability 1.x - 65535 x 1024 threads
  • compute capability 2.0+ - 65535 x 65535 x 65535 x 1024 threads

Another way is to change func signature to void (*func)(int i, void*), so the function will process i-th part of data. For multiple dimensions you can also change signature void (*func)(int i, int j, int k, void*). In my opinion this should be preferable, since __device__ functions also can be declared as __host__ and you can run it in parallel in CPU.

With void* problem I could recommend to use templates in C++ (+variadic templates), but in C it's ok.

Upvotes: 1

Related Questions