CUDA Thread Scheduling: custom thread swapping/event based locks?

Question

I am trying to configure the following nested thread architecture.

|   |   |
|   |   |
||| ||| |||
|vv |vv |vv
v   v   v

The main threads only continues after the nested threads have completed.

The problem is that in larger structures I may run into starvation problems as the nested threads run into what currently are custom locks with a standard mutex while loop. This won't be an issue until the program loads more threads that the GPU can actually run simultaneously. Is there a way to swap between active threads based on mutex logic?

Robert Crovella · Accepted Answer

The link you have given covers CUDA dynamic parallelism (CDP).

In the non-CDP case, if you intend to use mutexes/locks, it is the programmers responsibility to make sure that all necessary threads can make forward progress. There is no way to "swap" between active threads. Once a thread is made active by the GPU scheduler, it must be able to eventually make forward progress. It will consume a scheduler slot (a slot on the SM) until it is able to do so. You cannot change this.

There is an exception in the CDP case, which applies to the relationship between parent kernel and child kernels (only). A parent kernel is allowed to launch a child kernel, and the GPU thread scheduler will, if necessary, "swap" out parent kernel threads so that child kernel threads can make forward progress, and eventually satisfy the implicit or explicit synchronization in the parent thread that is dependent on completion of child grids.

However this exception for the CDP parent/child case does not mean that:

Parent threads will be swapped for other parent threads (perhaps those that are spinning on a programmer-written lock or mutex)
child threads will be swapped for other child threads

Within a grid, whether parent or child, it is the programmers responsibility to intelligently use locks or mutexes so that necessary forward progress can be made by the grid, without expecting that the CUDA runtime will swap out threads that have been assigned an active slot on a SM.

There is also no way to explicitly force the swapping of threads in and out of active SM slots. Implicit methods are the CDP mechanism already discussed, and CUDA stream priorities but niether one guarantees that swapping of threads will occur within a particular grid.

(Regarding stream priorities, in its current implementation I don't believe it will swap out threads or threadblocks that are currently scheduled, until they complete. It is actually an opportunistic scheduling control, not a preemptive one, which will schedule threadblocks from higher priority streams when the opportunity -- available scheduling slots on an SM -- presents itself. However, AFAIK, there is nothing in the CUDA execution model that explicitly prevents stream priorities from swapping out active threadblocks, so its possible the behavior could change in the future.)

CUDA Thread Scheduling: custom thread swapping/event based locks?

Answers (1)

Related Questions