function as templated parameter in cuda

Question

I'm attempting to write a reduction function in cuda (this is an exercise, I know that I'm doing things which have been done better by other people) which takes a binary associative operator and an array and reduces the array using the operator.

I'm having difficulty with how to pass the function. I've written hostOp() as a host based example which works fine.

deviceOp() works for the first statement with an explicit call to fminf(), but when I call the function parameter, there is an illegal memory access error.

#include 
#include 
#include 
using namespace std; //for brevity

__device__  float g_d_a = 9, g_d_b = 5;
float g_h_a = 9, g_h_b = 5;

template
__global__
void deviceOp(funcT op){    
    argT result = fminf(g_d_a, g_d_b);                  //works fine
    printf("static function result: %f
", result);
    result = op(g_d_a,g_d_b);                           //illegal memory access
    printf("template function result: %f
", result);
}

template                 
void hostOp(funcT op){
    argT result = op(g_h_a, g_h_b);
    printf("template function result: %f
", result);
}

int main(int argc, char* argv[]){
    hostOp(min);                          //works fine
    deviceOp<<<1,1>>>(fminf);

    cudaDeviceSynchronize(); 
    cout<



OUTPUT:

host function result: 5.000000
static function result: 5.000000
an illegal memory access was encountered


Assuming I'm not doing something horribly stupid, how should I be passing fminf to deviceOp so that there isn't an illegal memory access?

If I am doing something horribly stupid, what is a better way?

Robert Crovella · Accepted Answer

A function to be called on the device must be decorated with __device__ (or __global__, if you wish it to be a kernel). The nvcc compiler driver will then separate host and device code, and will use the device-compiled version of the function when it is called from (i.e. compiled in) device code, and the host version otherwise.

This construct is problematic:

deviceOp<<<1,1>>>(fminf);

While it may not be obvious, this is essentially all host code. Yes, it is launching a kernel (via an underlying sequence of library calls from host code), but it is technically host code. Therefore the fminf function address "captured" here will be the host version of the fminf function, even though a device version is available (via CUDA math.h, which you are not actually including).

A typical albeit clumsy approach to work around this is to "capture" the device address in device code, then pass it as a parameter to your kernel.

You can also short-circuit this process (somewhat) if you are passing function addresses that can be deduced at compile time, with a slightly different templating technique. These concepts are covered in this answer.

Here is a fully worked example of your code modified using the "capture function address in device code" method:

$ cat t1176.cu
#include 
#include 
#include 
using namespace std; //for brevity

__device__  float g_d_a = 9, g_d_b = 5;
float g_h_a = 9, g_h_b = 5;

template
__global__
void deviceOp(funcT op){
    argT result = fminf(g_d_a, g_d_b);                  //works fine
    printf("static function result: %f
", result);
    result = op(g_d_a,g_d_b);                           //illegal memory access
    printf("template function result: %f
", result);
}

__device__ float (*my_fminf)(float, float) = fminf;  // "capture" device function address

template
void hostOp(funcT op){
    argT result = op(g_h_a, g_h_b);
    printf("template function result: %f
", result);
}

int main(int argc, char* argv[]){
    hostOp(min);                          //works fine
    float (*h_fminf)(float, float);
    cudaMemcpyFromSymbol(&h_fminf, my_fminf, sizeof(void *));
    deviceOp<<<1,1>>>(h_fminf);

    cudaDeviceSynchronize();
    cout<

function as templated parameter in cuda

Answers (1)

Related Questions