Reputation: 2398
Well, I have a program which generates, JIT-compiles and runs PTX subprograms on GPU. Programs are running just fine and the run-times are pretty good - around 500x speedup vs CPU. Problem is that the compilation takes way too long, erasing all GPU speedup turning it to slowdown :)
Question is, is there a faster, more efficient way to do this? Can I reuse some resources, make the process more stream-like?
Edit: Each PTX program is run only once, and they are all very different, so JIT caching gains no benefit
This is my code, which is pretty much the same as nvidia-provided example JIT app:
CHECK_ERROR(cuLinkCreate(6, linker_options, linker_option_vals, &lState));
// Load the PTX from the string myPtx32
CUresult myErr = cuLinkAddData(lState, CU_JIT_INPUT_PTX, (void*) ptxProgram.c_str(), ptxProgram.size()+1, 0, 0, 0, 0);
// Complete the linker step
CHECK_ERROR(cuLinkComplete(lState, &linker_cuOut, &linker_outSize));
// Linker walltime and info_log were requested in options above.
//printf("CUDA Link Completed in %fms. Linker Output:\n%s\n", linker_walltime, linker_info_log);
// Load resulting cuBin into module
CHECK_ERROR(cuModuleLoadData(&hModule, linker_cuOut));
// Locate the kernel entry poin
CHECK_ERROR(cuModuleGetFunction(&hKernel, hModule, "_myBigPTXKernel"));
// Destroy the linker invocation
CHECK_ERROR(cuLinkDestroy(lState));
Upvotes: 0
Views: 2337
Reputation: 172
You may consider JIT caching. http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/
Upvotes: 1