user3413358
user3413358

Reputation: 11

Online compilation of single CUDA function

I have a function in my program called float valueAt(float3 v). It's supposed to return the value of a function at the given point. The function is user-specified. I have an interpreter for this function at the moment, but others recommended I compile the function online so it's in machine code and is faster.

How do I do this? I believe I know how to load the function when I have PTX generated, but I have no idea how to generate the PTX.

Upvotes: 1

Views: 1316

Answers (2)

Robert Crovella
Robert Crovella

Reputation: 152279

I've thought about this problem for a while, and while I don't think this is a "great" solution, it does seem to work so I thought I would share it.

The basic idea is to use linux to spawn processes to compile and then run the compiled code. I think this is pretty much a no-brainer, but since I put together the pieces, I'll post instructions here in case it's useful for somebody else.

The problem statement in the question is to be able to take a file that contains a user-defined function, let's assume it is a function of a single variable f(x), i.e. y = f(x), and that x and y can be represented by float quantities.

The user would edit a file called fx.txt that contains the desired function. This file must conform to C syntax rules.

fx.txt:

y=1/x

This file then gets included in the __device__ function that will be holding it:

user_testfunc.cuh:

__device__ float fx(float x){
  float y;
#include "fx.txt"
;
  return y;
}

which gets included in the kernel that is called via a wrapper.

cudalib.cu:

#include <math.h>
#include "cudalib.h"
#include "user_testfunc.cuh"

__global__ void my_kernel(float x, float *y){

  *y = fx(x);
}

float cudalib_compute_fx(float x){
  float *d, *h_d;
  h_d = (float *)malloc(sizeof(float));
  cudaMalloc(&d, sizeof(float));
  my_kernel<<<1,1>>>(x, d);
  cudaMemcpy(h_d, d, sizeof(float), cudaMemcpyDeviceToHost);
  return *h_d;
  }

cudalib.h:

float cudalib_compute_fx(float x);

The above files get built into a shared library:

nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so

We need a main application to use this shared library.

t452.cu:

#include <stdio.h>
#include <stdlib.h>
#include "cudalib.h"

int main(int argc, char* argv[]){

  if (argc == 1){
    //  recompile lib, and spawn new process
    int retval = system("nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so");
    char scmd[128];
    sprintf(scmd, "%s skip", argv[0]);
    retval = system(scmd);}
  else { // compute f(x) at x = 2.0
    printf("Result is: %f\n", cudalib_compute_fx(2.0));
    }
  return 0;
}

Which is compiled like this:

nvcc -arch=sm_20 -o t452 t452.cu -L. -lmycudalib

At this point, the main application (t452) can be executed and it will produce the result of f(2.0) which is 0.5 in this case:

$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 0.500000

The user can then modify the fx.txt file:

$ vi fx.txt
$ cat fx.txt
y = 5/x

And just re-run the app, and the new functional behavior is used:

$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 2.500000

This method takes advantage of the fact that upon recompilation/replacement of a shared library, a new linux process will pick up the new shared library. Also note that I've omitted several kinds of error checking for clarity. At a minimum I would check CUDA errors, and I would also probably delete the shared object (.so) library before recompiling it, and then test for its existence after compilation, to do a basic test that the compilation proceeded successfully.

This method entirely uses the runtime API to achieve this goal, so as a result the user would have to have the CUDA toolkit installed on their machine and appropriately set up so that nvcc is available in the PATH. Using the driver API with PTX code would make this process much cleaner (and not require the toolkit on the user's machine), but AFAIK there is no way to generate PTX from CUDA C without using nvcc or a user-created toolchain built on the nvidia llvm compiler tools. In the future, there may be a more "integrated" approach available in the "standard" CUDA C toolchain, or perhaps even by the driver.

A similar approach can be arranged using separate compilation and linking of device code, such that the only source code that needs to be exposed to the user is in user_testfunc.cu (and fx.txt).

EDIT: There is now a CUDA runtime compilation facility, which should be used in place of the above.

Upvotes: 2

talonmies
talonmies

Reputation: 72382

CUDA provides no way of runtime compilation of non-PTX code.

What you want can be done, but not using the standard CUDA APIs. PyCUDA provides an elegant just-in-time compilation method for CUDA C code which includes behind the scenes forking of the toolchain to compile to device code and loading using the runtime API. The (possible) downside is that you need to use Python for the top level of your application, and if you are shipping code to third parties, you might need to ship a working Python distribution too.

The only other alternative I can think of is OpenCL, which does support runtime compilation (that is all it supported until recently). The C99 language base is a lot more restrictive than what CUDA offers, and I find the APIs to be very verbose, but the runtime compilation model works well.

Upvotes: 3

Related Questions