user20995197
user20995197

Reputation:

CUDA C: host doesn't send all the threads at once

I'm trying to learn CUDA, so I wrote some silly code (see below) in order to understand how CUDA works. I set the number of blocks as 1024 but when I run my program it seems that the host doesn't send all the 1024 threads at once to the GPU. Instead, the GPU processes 350 threads first (approx), then another 350 threads, and so on. WHY? Thanks in advance!!

PS1: My computer has Ubuntu installed and an NVIDIA GeForce GTX 1660 SUPER

PS2: In my program, each block goes to sleep for a few seconds and nothing else. Also the host creates an array called "H_Arr" and sends it to the GPU, although the device does not use this array. Of course, the latter doesn't make much sense, but I'm just experimenting to understand how CUDA works.

    #include <stdio.h>
#include <stdlib.h>
#include <time.h>  
#include <math.h>
#include <limits>
#include <iostream>
#include <fstream>
#include<unistd.h> 

using namespace std;
__device__ void funcDev(int tid);

int    NB=1024;
int    NT=1;

__global__
void funcGlob(int NB, int NT){
   int tid=blockIdx.x*NT+threadIdx.x;

   # if __CUDA_ARCH__>=200
      printf("start block %d \n",tid);
      if(tid<NB*NT){
         funcDev(tid);    
      }

      printf("end block  %d\n",tid);
   #endif

}

__device__ void funcDev(int tid){      
   clock_t clock_count;

   clock_count =10000000000;

   clock_t start_clock = clock();
   clock_t clock_offset = 0;
   while (clock_offset < clock_count)
   {
      clock_offset = clock() - start_clock;
   }
}
int main(void)
{
   int i;
   ushort *D_Arr,*H_Arr;   
   H_Arr   = new ushort[NB*NT+1];
   
   for(i=1;i<=NB*NT+1;i++){H_Arr[i]=1;}

   cudaMalloc((void**)&D_Arr,(NB*NT+1)*sizeof(ushort));
   cudaMemcpy(D_Arr,H_Arr,(NB*NT+1)*sizeof(ushort),cudaMemcpyHostToDevice);
   
   funcGlob<<<NB,NT>>>(NB,NT);
      
   cudaFree(D_Arr);
   delete [] H_Arr;

   return 0;
}

I wrote a program in CUDA C. I set the number of blocks to be 1024. If I understood correctly, in theory 1024 processes should run simultaneously. However, this is not what happens.

Upvotes: 0

Views: 52

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151879

GTX 1660 Super seems to have 22 SMs.

It is a compute capability 7.5 device. If you run deviceQuery cuda sample code on your GPU, you can confirm that (the compute capability and the numbers of SMs, called "Multiprocessors"), and also discover that your GPU has a limit of 16 blocks resident per SM at any moment.

So I haven't studied your code at all, really, but since you are launching 1024 blocks (of one thread each), it would be my expectation that the block scheduler would deposit an initial wave of 16x22=352 blocks on the SMs, and then it would wait for some of those blocks to finish/retire before it would be able to deposit any more.

So an "initial wave" of 352 scheduled blocks sounds just right to me.

Throughout your posting, you refer primarily to threads. While it might be correct to say "350 threads are running" (since you are launching one thread per block) it isn't very instructive to think of it that way. The GPU work distributor schedules blocks, not individual threads, and the relevant limit here is the blocks per SM limit.

If you don't really understand the distinction between blocks and threads, or other basic CUDA concepts, you can find many questions here on the SO cuda tag about these concepts, and this online training series, particularly the first 3-4 units, will also be illuminating.

Upvotes: 3

Related Questions