mb13
mb13

Reputation: 159

Parallel Compilation of multiple CUDA architectures on same . cu file

I want my compiled CUDA code to work on any Nvidia GPU, so I compile each .cu file with the options:

-gencode arch=compute_20,code=sm_20
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_32,code=sm_32
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_53,code=sm_53
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61

(This is using CUDA 8.0 so I don't have the newer architectures listed yet.)

The issue is that nvcc compiles each of these targets synchronously, which can take quite a long time. Is there a way to split this up across multiple CPU cores? I'm using a Make build system.

I can manually make the .ptx or .cubin file for each architecture in a different async nvcc invocation easily using a different Make target for each architecture. However how do I combine these into a final .o file to be linked together with my host code?

This: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory Seems to imply I should take multiple .cubin files and combine them into a .fatbin file. However when I try to do that I get the error:

nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified

Is this possible? What am I missing? Thanks!

Edit 1: Following talonmies reply. I've tried to do:

F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc  -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include"  -DWIN32 --device-c -cubin -gencode arch=compute_30,code=sm_30 -o ms_30.cubin ms.cu
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc  -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include"  -DWIN32 --device-c -cubin -gencode arch=compute_35,code=sm_35 -o ms_35.cubin ms.cu

And then link with:

F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -o out.o -dlink ms_35.cubin ms_30.cubin -I"F:/SDKs/CUDASDK/9.2/include"

However I get the error:

fatbinary fatal   : fatbinary elf mismatch: elf arch '35' does not match '30'

All the examples using device link always just have one arch used. Is it possible to combine architectures this way?

Upvotes: 2

Views: 1814

Answers (2)

tera
tera

Reputation: 7245

nvcc is merely a front-end issuing commands to a number of other tools. If you add the --dryrun flag to your nvcc invocation, it will print the exact commands you need to run to replace your use of nvcc.

From there it should be easy to convert this list of commands into a script or makefile.

Update: nvcc from CUDA 11.3 finally supports this out of the box via the -t flag.

Upvotes: 4

talonmies
talonmies

Reputation: 72349

The tool chain doesn't support this and you shouldn't expect to be able to do this by hand as nvcc does either.

However, you can certainly script some sort process to

  1. Execute parallel compilation of the code to multiple cubin files, one for each target architecture
  2. Perform a device link pass to combine the cubins to a single elf payload
  3. Link the final executable with the resulting object file emitted by the device link phase

You will probably need to enable separate device code compilation and you might also need to refactor your code slightly as a result. Caveat Emptor and all that.

Upvotes: 1

Related Questions