Why launching a Numba cuda kernel works with up to 640 threads, but fails with 641 when there's plenty of GPU memory free?

Question

I have a Numba cuda kernel which I can launch with up to 640 threads and 64 blocks on an RTX 3090.

If I attempt to use 641 threads, it fails with:

Traceback (most recent call last):
  File "/home/stark/Work/mmr6/mmr/algos/company_analysis/_analysis_gpu_backup.py", line 905, in 
    load()
  File "/home/stark/Work/mmr6/mmr/algos/company_analysis/_analysis_gpu_backup.py", line 803, in load_markets
    run_simulations[algo_configs.BLOCK_COUNT, algo_configs.THREAD_COUNT, stream](
  File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/compiler.py", line 821, in __call__
    return self.dispatcher.call(args, self.griddim, self.blockdim,
  File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/compiler.py", line 966, in call
    kernel.launch(args, griddim, blockdim, stream, sharedmem)
  File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/compiler.py", line 693, in launch
    driver.launch_kernel(cufunc.handle,
  File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2094, in launch_kernel
    driver.cuLaunchKernel(cufunc_handle,
  File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 300, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/stark/anaconda3/envs/mmr-env/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 335, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [701] Call to cuLaunchKernel results in CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

But when I look into nvidia-smi I see it takes just 2.9GB of memory to run with 640 threads. This GPU has 22GB unused.

What else can be the issue on such situations? I read somewhere that grid size, block size, register usage and shared memory usage are considerations. How can I find out how many registers and shared memory I am using?

Robert Crovella · Accepted Answer

It's usually a registers per thread issue (CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES). This is covered in many questions here on SO cuda tag such as this one. There are many others also such as here. In short, the total registers used per threadblock cannot exceed the limit for your GPU (see below). Total registers used per theadblock is approximately the total number of registers per thread times the threads per block (potentially rounding up for allocation granularity).

The principal method to address this issue in numba cuda is to include a maximum register usage parameter in your cuda.jit decorator:

@cuda.jit( max_registers=40)

You can of course set that to other values. A simple heuristic is to divide the total number of registers per SM (or per thead block if it is lower) (discoverable with CUDA deviceQuery sample code or in table 15 of the programming guide) by the total number of threads per block you wish to launch. So if your GPU SM has 64K registers, and you want to launch 1024 threads per block, you would choose a maximum of 64 registers per thread. That number should work for RTX 3090.

Why launching a Numba cuda kernel works with up to 640 threads, but fails with 641 when there's plenty of GPU memory free?

Answers (1)

Related Questions

Why launching a Numba cuda kernel works with up to 640 threads, but fails with 641 when there&#39;s plenty of GPU memory free?

Answers (1)

Related Questions

Why launching a Numba cuda kernel works with up to 640 threads, but fails with 641 when there's plenty of GPU memory free?