Reputation: 1273
I'm attempting to write a reduction function in cuda (this is an exercise, I know that I'm doing things which have been done better by other people) which takes a binary associative operator and an array and reduces the array using the operator.
I'm having difficulty with how to pass the function. I've written hostOp() as a host based example which works fine.
deviceOp() works for the first statement with an explicit call to fminf(), but when I call the function parameter, there is an illegal memory access error.
#include <iostream>
#include <cstdio>
#include <cmath>
using namespace std; //for brevity
__device__ float g_d_a = 9, g_d_b = 5;
float g_h_a = 9, g_h_b = 5;
template<typename argT, typename funcT>
__global__
void deviceOp(funcT op){
argT result = fminf(g_d_a, g_d_b); //works fine
printf("static function result: %f\n", result);
result = op(g_d_a,g_d_b); //illegal memory access
printf("template function result: %f\n", result);
}
template<typename argT, typename funcT>
void hostOp(funcT op){
argT result = op(g_h_a, g_h_b);
printf("template function result: %f\n", result);
}
int main(int argc, char* argv[]){
hostOp<float>(min<float>); //works fine
deviceOp<float><<<1,1>>>(fminf);
cudaDeviceSynchronize();
cout<<cudaGetErrorString(cudaGetLastError())<<endl;
}
OUTPUT:
host function result: 5.000000
static function result: 5.000000
an illegal memory access was encountered
Assuming I'm not doing something horribly stupid, how should I be passing fminf to deviceOp so that there isn't an illegal memory access?
If I am doing something horribly stupid, what is a better way?
Upvotes: 0
Views: 544
Reputation: 151849
A function to be called on the device must be decorated with __device__
(or __global__
, if you wish it to be a kernel). The nvcc
compiler driver will then separate host and device code, and will use the device-compiled version of the function when it is called from (i.e. compiled in) device code, and the host version otherwise.
This construct is problematic:
deviceOp<float><<<1,1>>>(fminf);
While it may not be obvious, this is essentially all host code. Yes, it is launching a kernel (via an underlying sequence of library calls from host code), but it is technically host code. Therefore the fminf
function address "captured" here will be the host version of the fminf
function, even though a device version is available (via CUDA math.h
, which you are not actually including).
A typical albeit clumsy approach to work around this is to "capture" the device address in device code, then pass it as a parameter to your kernel.
You can also short-circuit this process (somewhat) if you are passing function addresses that can be deduced at compile time, with a slightly different templating technique. These concepts are covered in this answer.
Here is a fully worked example of your code modified using the "capture function address in device code" method:
$ cat t1176.cu
#include <iostream>
#include <cstdio>
#include <cmath>
using namespace std; //for brevity
__device__ float g_d_a = 9, g_d_b = 5;
float g_h_a = 9, g_h_b = 5;
template<typename argT, typename funcT>
__global__
void deviceOp(funcT op){
argT result = fminf(g_d_a, g_d_b); //works fine
printf("static function result: %f\n", result);
result = op(g_d_a,g_d_b); //illegal memory access
printf("template function result: %f\n", result);
}
__device__ float (*my_fminf)(float, float) = fminf; // "capture" device function address
template<typename argT, typename funcT>
void hostOp(funcT op){
argT result = op(g_h_a, g_h_b);
printf("template function result: %f\n", result);
}
int main(int argc, char* argv[]){
hostOp<float>(min<float>); //works fine
float (*h_fminf)(float, float);
cudaMemcpyFromSymbol(&h_fminf, my_fminf, sizeof(void *));
deviceOp<float><<<1,1>>>(h_fminf);
cudaDeviceSynchronize();
cout<<cudaGetErrorString(cudaGetLastError())<<endl;
}
$ nvcc -o t1176 t1176.cu
$ cuda-memcheck ./t1176
========= CUDA-MEMCHECK
template function result: 5.000000
static function result: 5.000000
template function result: 5.000000
no error
========= ERROR SUMMARY: 0 errors
$
Upvotes: 1