Performance hit in CUDA program that calls kernel repeatedly within a for loop

Question

I have a CUDA program that calls the kernel repeatedly within a for loop. The code computes all rows of a matrix by using the values computed in the previous one until the entire matrix is done. This is basically a dynamic programming algorithm. The code below fills the (i,j) entry of many separate matrices in parallel with the kernel.

for(i = 1; i <=xdim; i++){

  for(j = 1; j <= ydim; j++){ 

    start3time = clock();
    assign5<<>>(Z, i, j, x, y, z)
    end3time = clock(); 
    diff = static_cast(end3time-start3time)/(CLOCKS_PER_SEC / 1000); 
    printf("Time for i=%d j=%d is %f
", i, j, diff); 
  }

}

The kernel assign5 is straightforward

__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {

  int id = threadIdx.x + blockIdx.x * blockDim.x;

  char ch = database[j + id];

  Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];

  }

}

My problem is that when I run this program the time for each i and j is 0 most of the time but sometimes it is 10 milliseconds. So the output looks like

Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.

I don't understand why this is happening. I don't see a thread race condition. If I add

if(i % 20 == 0) cudaThreadSynchronize();

right after the first loop then the Time for i and j is mostly 0. But then the time for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations at low cost and then charges a lot for later ones. Any help would be appreciated.

shapecatcher · Accepted Answer

I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device's queue. If you're measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).

You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.

This is why

if(i % 20 == 0) cudaThreadSynchronize();

made spikes in your measurments.

Performance hit in CUDA program that calls kernel repeatedly within a for loop

Answers (1)

Related Questions