Reputation: 131525
Suppose I take a CUDA program - for example the CUDA vectorAdd
sample, and cut out the kernel's implementation, but still have the launch command:
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
and suppose that I write my own PTX since I'm a DIY kind of a guy, so that now I have vectorAdd.cu
without the kernel's CUDA code and vectorAdd.ptx
.
Can I now generate an executable which will work like the unmodified vectorAdd would, but running the code in the PTX?
(Assume for the same of discussion that the PTX doesn't try anything funny or do anything wrong.)
Notes:
This question is a variant on:
How can I create an executable to run a kernel in a given PTX file?
Except that, in that question, the poster was willing to use the driver API to dynamically load and compile a PTX file using the driver API. Here, that's not an option: The C++ code uses a triple-chevron CUDA runtime launch, and this must not change.
I don't mind the process of creating the executable involving the generation of other files, e.g. a cubin.
Upvotes: 1
Views: 920
Reputation: 1974
EDIT/Update
I had to change some of the procedures for a newer toolkit, see below for the old description.
Some details seem to have changed with new Toolkit versions and I did create a bit more documentation on the way.
Tested on Windows with VS 2022 19.31.31104 with a sm_75 Turing GPU and nvcc 11.7, doing a debug build. It should also work on Linux and/or with other host compilers.
Step 1 and 2
We can start with full kernels and patch the PTX or SASS or we can create stub kernels.
Using here the example project from the VS Cuda Plugin, both the main()
and an addWithCuda()
helper function for the host and an addKernel()
global device kernel are included in a single file kernel.cu
.
Step 3
Copying the command line from Visual Studio, removing the paths and adding the -keep
option, we also remove -x cu
and the output file -o kernel.cu.obj
to have to change less later on:
nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -G --keep -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd" kernel.cu
This command should be run on the command line with x64 paths for the Visual Studio compiler.
The following files are generated:
kernel.cpp1.ii (raw code after preprocessor, different prepr. macros)
kernel.cpp1.ii.res (command line parameters for preprocessor)
kernel.cpp4.ii (raw code after preprocessor, different prepr. macros)
kernel.cpp4.ii.res (command line parameters for preprocessor)
kernel.cu.obj (compiled code from C++ and Cuda)
kernel.cu.obj.res (command line parameters)
kernel.cudafe1.c (runtime stub, includes kernel.cudafe1.stub.c)
kernel.cudafe1.cpp (C++ code after preprocessor, Cuda code only as declaration (otherwise #if 0) and launches with __cudaPushCallConfiguration)
kernel.cudafe1.gpu (Cuda code after preprocessor, C++ code removed)
kernel.cudafe1.stub.c (runtime stub to register and launch the kernel)
kernel.fatbin (fatbin of the kernel)
kernel.fatbin.c (dump of kernel.fatbin as C array)
kernel.module_id (ID identifying the kernel)
kernel.ptx (The ptx from the original .cu)
kernel.sm_75.cubin (The compiled ptx file from the original .cu)
vc143.pdb (Visual Studio program database - host stuff)
Important to keep are kernel.cudafe1.cpp
and kernel.cudafe1.stub.c
enabling the registering and runtime launch of the kernel.
Step 4
The kernel.ptx
can be used as a structure for recreating PTX code or the kernel.sm_75.cubin
or directly the kernel.fatbin.c
for patching the SASS code.
Step 5
Now we are assembling the .ptx file with -fatbin -dlink
and change the input file from kernel.cu
to kernel.ptx
:
nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -G --keep -maxrregcount=0 --machine 64 -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd " -fatbin -dlink kernel.ptx
We get
a_dlink.fatbin
a_dlink.sm_75.cubin
a_dlink.sm_75.cubin.optf
kernel.fatbin
kernel.fatbin.c (fatbin dump from new ptx)
kernel.obj
kernel.obj.res
kernel.sm_75.cubin
vc143.pdb
Note: Instead of a .ptx
file also a .cubin
file can be made to a .fatbin.c
with -fatbin -dlink
, but then I had to specify a non-virtual compute architecture e.g. by --gpu-architecture=sm_75
.
Now we have the host files together: The kernel.fatbin.c
and the kernel.cudafe1.cpp
and kernel.cudafe1.stub.c
.
Steps 6 to 8
We can put those through host compilation
nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -G --keep -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd" --compile kernel.cudafe1.cpp input\kernel.fatbin.c
Step 9
and through the host linker
link /OUT:"kernel.exe" /MANIFEST /NXCOMPAT /DYNAMICBASE "cudart_static.lib" "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" "cudart.lib" "cudadevrt.lib" /DEBUG /MACHINE:X64 /INCREMENTAL /SUBSYSTEM:CONSOLE /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /ERRORREPORT:PROMPT /NOLOGO /LIBPATH:"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\x64" /TLBID:1 kernel.cudafe1.obj kernel.fatbin.obj
Finished
creating kernel.exe
.
Old Description
Perhaps still relevant with other toolkit versions.
__ global __ void vectorAdd(void* d_A, void* d_B, void* d_C, int numElements);in a header visible to the caller
__ global __ void vectorAdd(void* d_A, void* d_B, void* d_C, int numElements) {}
nvcc --keep vectorAdd.cuwith suitable options
nvcc -fatbin -dlinkto create the fatbin and cubin files
(As advanced DIY guy you will want to write SASS code in the future, which is the device-specific lower-level assembly language.)
Upvotes: 1