BAdhi
BAdhi

Reputation: 510

Using Lambda functions in CUDA with template functions

I was trying to implement a Generic Class using Cuda for common algorithms like Reduce or Scan providing some pre processing such as a simple map inside the algorithm. This map operations are carried out before the actual reduce/scan algorithm. In order to realize this I was hoping to use the lambda functions. Following is the way I tried to implement this.

template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
    //Actual Reduce Algorithm Comes here 
}

template<typename T, typename LAMBDA> 
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    lam();

    ReduceOperationPerThread(d_in, d_out, size);
}

The helper function that invokes this kernel is created as follows,

template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    // preparing block sizes, grid sizes
    // and additional logic for invoking the kernel goes here
    // with the Kernel invocation as following

    ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}

All of the above code is included in a source named Reduce.cu and the respective header is created as Reduce.h as following

// Reduce.h
template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam);

So at the end of the day the complete Reduce.cu looks like this,

// Reduce.cu
template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
    //Actual Reduce Algorithm Comes here 
}

template<typename T, typename LAMBDA> 
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    lam();

    ReduceOperationPerThread(d_in, d_out, size);
}

template<typename T, typename LAMBDA>
void ReduceWPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    // preparing block sizes, grid sizes
    // and additional logic for invoking the kernel goes here
    // with the Kernel invocation as following

    ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}

But the problem I'm having is related to writing template functions in separate .h and .cu files

In normal cases where lambda functions are not used, what I used to do was adding all the possible implementations of the function with possible values for template parameter at the end of the .cu file as mentioned in here, under FAQ - "How can I avoid linker errors with my template classes?"

// At the end of the Reduce.cu file
// Writing functions with possible template values 
// For A normal Reduce function

template void Reduce<double>(double * d_in, double * d_out, unsigned int size);
template void Reduce<float>(float * d_in, float* d_out, unsigned int size);
template void Reduce<int>(int * d_in, int * d_out, unsigned int size);

But in this case possible value for template parameter LAMBDA cannot be predefined.

template void ReduceWPreprocessing<int>(int * d_in, int * d_out, unsigned int size, ??? lambda);

Is there another way to use lambda functions for this kind of applications?

Upvotes: 2

Views: 1092

Answers (1)

talonmies
talonmies

Reputation: 72344

[Summarizing comments into a community wiki answer to get this question off the unanswered queue]

At the time the question was posted, there was no way to do what the question asked, because CUDA lacked the equivalent of a placeholder mechanism which could capture lambda expressions.

However, CUDA (as of version 8, released Q1 2017) now has a std::function like polymorphic function wrapper called nvfunctional. This will allow you to define a generic type for the lambda expression which can be used as a template parameter during instantiation, and then capture a lambda passed as an argument and call it in a generic fashion.

Upvotes: 1

Related Questions