Reputation: 131547

Why do I need an intermediate global to get CUDA device-side function addresses?

I've been pondering the answers to this question:

How to pass device function as an input argument to host-side function?

and especially Robert Crovella's answer. I don't quite understand why the intermediate global symbol is necessary. That is, why does this work:

#include <stdio.h>

__device__ int f1(){ printf("dev f1\n"); return 0;}

__device__ void *fptrf1 = (void*) f1;

__global__ void mykernel(int (*fptr)()) {
  fptr();
  printf("executed\n");
}

int main() {
  void *hf1;
  cudaMemcpyFromSymbol(&hf1, fptrf1, sizeof(int *));
  mykernel<<<1,1>>>((int (*)())hf1);
  cudaDeviceSynchronize();
}

but this doesn't work:

#include <stdio.h>

__device__ int f1(){ printf("dev f1\n"); return 0;}

__global__ void mykernel(int (*fptr)()) {
  fptr();
  printf("executed\n");
}

int main() {
  void *hf1;
  cudaMemcpyFromSymbol(&hf1, f1, sizeof(int *));
  mykernel<<<1,1>>>((int (*)())hf1);
  cudaDeviceSynchronize();
}

? I mean, isn't a function a symbol? And if the global pointer's device-side address can be "known" to my host-side code, why can't the function itself? And if it doesn't work - why does it compile and not complain?

Upvotes: 0

Answers (1)

talonmies

Reputation: 72348

I mean, isn't a function a symbol?

No, it isn't.

I have no special insight here, but no doubt that part of the reason for this is historical: when the CUDA APIs were invented, __device__ functions were merely a programming aid. There was no ABI, no function pointer support, and all device functions were inline expanded by the compiler. The only static device symbols which were emitted were __global__ functions, texture references, and __device__ variables. So there was absolutely no way that sort of usage was envisaged, or possible, when the language and APIs were put together 15 years ago.

Even with today's post ABI and post ELF format device toolchain (originally everything was plain text with embedded strings), you won't find __device__ functions exposed by the device object file ELF interface. It is not possible to retrieve an arbitrary __device__ function via any of the host APIs, unlike __global__ functions and other device symbols.

And if the global pointer's device-side address can be "known" to my host-side code, why can't the function itself?

See above. The APIs have never exposed this.

And if it doesn't work - why does it compile and not complain?

Because of the compilation trajectory. The CUDA front end does this to your __device__ function in host code (and there is no discrimination here, it does it to every __device__ function, including the internal toolchain functions and device libraries):

# 3 "unobtainium.cu"
__attribute__((unused)) int f1() {int volatile ___ = 1;::exit(___);}
#if 0
# 3
{ printf("dev f1\n"); return 0; } 
#endif

i.e. it creates a dummy host stub so that everything compiles. Kernels and device symbols also get stubs, but with different boilerplate. Those boilerplate stubs match up with the tags which the internal runtime functions use to make the host side runtime API work. But device functions don't, because they are not exposed by the CUDA device code APIs.

And finally your original question:

Why does this work:

#include <stdio.h>

__device__ int f1(){ printf("dev f1\n"); return 0;}

__device__ void *fptrf1 = (void*) f1;

__global__ void mykernel(int (*fptr)()) {
  fptr();
  printf("executed\n");
}

int main() {
  void *hf1;
  cudaMemcpyFromSymbol(&hf1, fptrf1, sizeof(int *));
  mykernel<<<1,1>>>((int (*)())hf1);
  cudaDeviceSynchronize();
}

The interesting thing here is that it didn't always work. Once upon a time you would have had to run a setup kernel to initialize the device side function pointer. Somewhere around CUDA 5 it started working this way. Why is relatively straightforward -- a compilation unit scope __device__ variable is a valid device symbol, so exposed by the host APIs, and the device side linker can (now) statically assign the correct value during linkage so that when the runtime initializes, the value is correct. But note that it is static assignment, no anything which happens at runtime.

Upvotes: 3

Why do I need an intermediate global to get CUDA device-side function addresses?

Answers (1)

Related Questions