NothingMore
NothingMore

Reputation: 1281

What is the behavior of thread block scheduling to specific SM's after CUDA kernel launch?

My question is on the scheduling of thread blocks in CUDA (specifically kepler or newer nvidia architectures) after execution of a kernel has already begun.

From my understanding of the kepler architecture (which may be incorrect) there is a limit to the number of active blocks which can be scheduled to a single SM at any moment in time (16 blocks if I am remembering correctly). Also from my understanding blocks cannot move once they are scheduled to run on a specific SM.

What I am curious about is the block scheduling and execution behavior after the initial selection of blocks takes place and have begun executing on the device (assuming that a kernel has more thread blocks than can be active in all SMs).

Are new blocks executed as soon as a single currently running active block completes in a SM? Or is the next set of blocks executed only after a SM completes all of its currently active blocks? Or are they started only after all SMs complete all currently active blocks executing?

In addition I have heard that block scheduling is "fixed" to a single SM. I am making the assumption it is fixed to a single SM only after the block becomes active. Is this the case?

Upvotes: 4

Views: 2284

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

New blocks can be scheduled as soon as an SM has sufficient unused resources to support the new block. It is not necessary for the SM to be completely drained of blocks before new blocks can be scheduled.

As pointed out in the comments, if you now ask for public documentation to support this assertion, I'm not sure I can point to it. However it's possible to create a test case and prove this to yourself.

In a nutshell, you would create a block-specialized kernel that would launch many blocks. The first block on each SM would discover and declare itself using atomics. These blocks would "persist" until all other blocks had completed, using a block-completed counter (again, using atomics, similar to the threadfence reduction sample code). All other blocks that are not the first to launch on a given SM would simply exit. The completion of such a code, as opposed to it hanging, would be the proof that other blocks can be scheduled even if some blocks are still resident.

Here is a fully worked example:

$ cat t743.cu
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define NB 1000
// increase array length here if your GPU has more than 32 SMs
#define MAX_SM 32
// set HANG_TEST to 1 to demonstrate a hang for test purposes
#define HANG_TEST 0

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

static __device__ __inline__ uint32_t __smid(){
    uint32_t smid;
    asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
    return smid;}

__device__ volatile int blocks_completed = 0;
// increase array length here if your GPU has more than 32 SMs
__device__ int first_SM[MAX_SM];

// launch with one thread per block only
__global__ void tkernel(int num_blocks, int num_SMs){

  int my_SM = __smid();
  int im_not_first = atomicCAS(first_SM+my_SM, 0, 1);
  if (!im_not_first){
    while (blocks_completed < (num_blocks-num_SMs+HANG_TEST));
  }
  atomicAdd((int *)&blocks_completed, 1);
}

int main(int argc, char *argv[]){
  unsigned my_dev = 0;
  if (argc > 1) my_dev = atoi(argv[1]);
  cudaSetDevice(my_dev);
  cudaCheckErrors("invalid CUDA device");
  int tot_SM = 0;
  cudaDeviceGetAttribute(&tot_SM, cudaDevAttrMultiProcessorCount, my_dev);
  cudaCheckErrors("CUDA error");
  if (tot_SM > MAX_SM) {printf("program configuration error\n"); return 1;}
  printf("running on device %d, with %d SMs\n", my_dev, tot_SM);
  int temp[MAX_SM];
  for (int i = 0; i < MAX_SM; i++) temp[i] = 0;
  cudaMemcpyToSymbol(first_SM, temp, MAX_SM*sizeof(int));
  cudaCheckErrors("cudaMemcpyToSymbol fail");
  tkernel<<<NB, 1>>>(NB, tot_SM);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel error");
}

$ nvcc -o t743 t743.cu
$ ./t743 0
running on device 0, with 15 SMs
$ ./t743 1
running on device 1, with 1 SMs
$ ./t743 2

I have tested the above code on linux with CUDA 7, on a K40c, C2075, and Quadro NVS 310 GPU. It doesn't hang.

To answer your second question, a block generally remains on the SM on which it was first scheduled. One possible exception is in the case of CUDA dynamic parallelism.

Upvotes: 9

Related Questions