Reputation: 159
I want my compiled CUDA code to work on any Nvidia GPU, so I compile each .cu file with the options:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_32,code=sm_32
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_53,code=sm_53
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61
(This is using CUDA 8.0 so I don't have the newer architectures listed yet.)
The issue is that nvcc compiles each of these targets synchronously, which can take quite a long time. Is there a way to split this up across multiple CPU cores? I'm using a Make build system.
I can manually make the .ptx or .cubin file for each architecture in a different async nvcc invocation easily using a different Make target for each architecture. However how do I combine these into a final .o file to be linked together with my host code?
This: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory Seems to imply I should take multiple .cubin files and combine them into a .fatbin file. However when I try to do that I get the error:
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
Is this possible? What am I missing? Thanks!
Edit 1: Following talonmies reply. I've tried to do:
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include" -DWIN32 --device-c -cubin -gencode arch=compute_30,code=sm_30 -o ms_30.cubin ms.cu
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include" -DWIN32 --device-c -cubin -gencode arch=compute_35,code=sm_35 -o ms_35.cubin ms.cu
And then link with:
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -o out.o -dlink ms_35.cubin ms_30.cubin -I"F:/SDKs/CUDASDK/9.2/include"
However I get the error:
fatbinary fatal : fatbinary elf mismatch: elf arch '35' does not match '30'
All the examples using device link always just have one arch used. Is it possible to combine architectures this way?
Upvotes: 2
Views: 1814
Reputation: 7245
nvcc
is merely a front-end issuing commands to a number of other tools.
If you add the --dryrun
flag to your nvcc
invocation, it will print the exact commands you need to run to replace your use of nvcc
.
From there it should be easy to convert this list of commands into a script or makefile.
Update: nvcc
from CUDA 11.3 finally supports this out of the box via the -t
flag.
Upvotes: 4
Reputation: 72349
The tool chain doesn't support this and you shouldn't expect to be able to do this by hand as nvcc does either.
However, you can certainly script some sort process to
You will probably need to enable separate device code compilation and you might also need to refactor your code slightly as a result. Caveat Emptor and all that.
Upvotes: 1