Reputation: 6185
I have a Numba cuda kernel which I can launch with up to 640 threads and 64 blocks on an RTX 3090.
If I attempt to use 641 threads, it fails with:
Traceback (most recent call last):
File "/home/stark/Work/mmr6/mmr/algos/company_analysis/_analysis_gpu_backup.py", line 905, in <module>
load()
File "/home/stark/Work/mmr6/mmr/algos/company_analysis/_analysis_gpu_backup.py", line 803, in load_markets
run_simulations[algo_configs.BLOCK_COUNT, algo_configs.THREAD_COUNT, stream](
File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/compiler.py", line 821, in __call__
return self.dispatcher.call(args, self.griddim, self.blockdim,
File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/compiler.py", line 966, in call
kernel.launch(args, griddim, blockdim, stream, sharedmem)
File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/compiler.py", line 693, in launch
driver.launch_kernel(cufunc.handle,
File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2094, in launch_kernel
driver.cuLaunchKernel(cufunc_handle,
File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 300, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 335, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [701] Call to cuLaunchKernel results in CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
But when I look into nvidia-smi I see it takes just 2.9GB of memory to run with 640 threads. This GPU has 22GB unused.
What else can be the issue on such situations? I read somewhere that grid size, block size, register usage and shared memory usage are considerations. How can I find out how many registers and shared memory I am using?
Upvotes: 2
Views: 780
Reputation: 151973
It's usually a registers per thread issue (CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
). This is covered in many questions here on SO cuda
tag such as this one. There are many others also such as here. In short, the total registers used per threadblock cannot exceed the limit for your GPU (see below). Total registers used per theadblock is approximately the total number of registers per thread times the threads per block (potentially rounding up for allocation granularity).
The principal method to address this issue in numba cuda is to include a maximum register usage parameter in your cuda.jit
decorator:
@cuda.jit( max_registers=40)
You can of course set that to other values. A simple heuristic is to divide the total number of registers per SM (or per thead block if it is lower) (discoverable with CUDA deviceQuery
sample code or in table 15 of the programming guide) by the total number of threads per block you wish to launch. So if your GPU SM has 64K registers, and you want to launch 1024 threads per block, you would choose a maximum of 64 registers per thread. That number should work for RTX 3090.
Upvotes: 5