Cydouzo
Cydouzo

Reputation: 525

Pass a __device__ lambda as argument to a __global__ function

Defining __device__ lambdas is quite useful.

I wanted to do the same thing as the code below, but with a lambda defined in different files from the kernel that will use it.

// Sample code that works
template<typename Func>
__global__ void kernel(Func f){
    f(threadIdx.x);
}

int main(){
    auto f = [] __device__ (int i){ printf("Thread n°%i\n",i); };
    kernel<<<1,16>>>(f);
}

I tried this (not working) implementation.

main.cu

#include "kernelFile.h"

int main(){
    auto f = [] __device__ (int i){ printf("Thread n°%i\n",i); };
    kernelCaller(f);
}

kernelFile.cu

template<typename Func>
__global__ void kernel(Func f){
    f(threadIdx.x);
}

template<typename Func>
__host__ void kernelCaller(Func f){
    kernelCaller(f);
}

But the compiler complains because kernelCaller is never instantiated. I don't know if it's possible to instantiate it or not, or if what I'm trying to do should be implemented differently. Any hint on what I should do?

Upvotes: 0

Views: 637

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152123

There is no way to provide a proper function instantiation for a templated function, unless the desired type is known in the compilation unit where the function is being instantiated. This isn't specific to CUDA.

Therefore, any method will require some knowledge of the desired type in the compilation unit where the kernel function is being compiled/instantiated. With that as a proviso, one possible approach is covered here. We can avoid the type-uncertainty associated with a lambda by wrapping it in nvstd::function object. Then, have your kernel accept a nvstd::function object (which can be effectively a type-wrapper for the lambda) and have your host caller insert the desired lambda into the nvstd::function object.

Here is an example:

$ cat k.cu
#include <nvfunctional>
#include <cstdio>

typedef nvstd::function<int(unsigned)> v;
__global__ void kernel(v *f){

  printf("%d, %d\n", threadIdx.x, (*f)(threadIdx.x));
}

__host__ void kernelCaller(v *f){
  kernel<<<1,2>>>(f);
}

$ cat m.cu
#include <nvfunctional>
// prototype would normally be in a header file
void kernelCaller(nvstd::function<int(unsigned)> *);


template <typename T1, typename T2>
__global__ void inserter(T1 *f, T2 l){
  *f = l;
}


int main(){

  nvstd::function<int(unsigned)> *d_f;
  cudaMalloc(&d_f, sizeof(nvstd::function<int(unsigned)>));
  auto lam1 = [] __device__ (unsigned i) { return i+1;};
  inserter<<<1,1>>>(d_f, lam1);
  kernelCaller(d_f);
  auto lam2 = [] __device__ (unsigned i) { return (i+1)*2;};
  inserter<<<1,1>>>(d_f, lam2);
  kernelCaller(d_f);
  cudaDeviceSynchronize();
}
$ nvcc -o test k.cu m.cu -std=c++11 -expt-extended-lambda -rdc=true
$ cuda-memcheck ./test
========= CUDA-MEMCHECK
0, 1
1, 2
0, 2
1, 4
========= ERROR SUMMARY: 0 errors
$

Upvotes: 3

Related Questions