Hair of Slytherin
Hair of Slytherin

Reputation: 406

cudaDeviceSynchronize and performing sequential work

I have a program that, when after profiled with nvprof, says that ~98% of the execution time is devoted to cudaDeviceSynchronize. In thinking about how to optimize the following code, I'm brought back here to try and confirm my understanding of the need for cudaDeviceSynchronize.

The general layout of my program is thus :

Copy input array to GPU.
program<<<1,1>>>(inputs)
Copy outputs back to host.

Thus, my program kernel is a master thread that basically looks like this :

for (int i = 0; i < 10000; i++)
{
    calcKs(inputs);
    takeStep(inputs);
}

The calcKs function is one of the most egregious abusers of cudaDeviceSynchronize and look like this :

//Calculate k1's
//Calc fluxes for r = 1->(ml-1), then for r = 0, then calc K's
zeroTemps();
calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1);                              //temp1 calculated from temp2
cudaDeviceSynchronize();
calcK<<< numBlocks, numThreads >>>(k1s, temp2);             //k1s calculated from temp2
cudaDeviceSynchronize(); 

where arrays temp2, temp1 and k1s are each calculated from the results of each other. My understanding was that cudaDeviceSynchronize was essential because I need temp2 to be completely calculated before temp1 is calculated and same for temp1 and k1s.

I feel like I've critically misunderstood the function of cudaDeviceSynchronize from reading this post : When to call cudaDeviceSynchronize?. I'm not sure how pertinent the comments on there are to my situation, however, as all of my program is running on the device and there's no CPU-GPU interaction until the final memory copy back to host, hence I don't get the implicit serialization caused by the memCpy

Upvotes: 0

Views: 1973

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

CUDA activities (kernel calls, memcopies, etc.) issued to the same stream will be serialized.

When you don't use streams at all in your application, everything you are doing is in the default stream.

Therefore, in your case, there is no functional difference between:

calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1);                              //temp1 calculated from temp2

and:

calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
calcMonomerFlux(temp2, temp1);                              //temp1 calculated from temp2

You don't show what calcMonomerFlux is, but assuming it uses data from temp2 and is doing calculations on the host, it must be using cudaMemcpy to grab the temp2 data before it actually uses it. Since the cudaMemcpy will be issued to the same stream as the preceding kernel call (calcFlux) it will be serialized, i.e. it will not begin until calcFlux is done. Your other code depending on temp2 data in calcMonomerFlux presumably executes after the cudaMemcpy, which is a blocking operation, so it will not begin executing until the cudaMemcpy is done.

Even if calcMonomerFlux contains kernels that operate on temp2 data, the argument is the same. Those kernels are presumably issued to the same stream (default stream) as calcFlux, and therefore will not begin until calcFlux is complete.

So the cudaDeviceSynchronize() call is almost certainly not needed.

Having said that, cudaDeviceSynchronize() by itself should not consume a tremendous amount of overhead. The reason that most of your execution time is being attributed to cudaDeviceSynchronize(), is because from a host thread perspective, this sequence:

calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
cudaDeviceSynchronize();

spends almost all its time in the cudaDeviceSynchronize() call. The kernel call is asynchronous, meaning it launches the kernel and then immediately returns control to the host thread, allowing the host thread to continue. Therefore the overhead in the host thread for a kernel call may be as low as a few microseconds. But the cudaDeviceSynchronize() call will block the host thread until the preceding kernel call completes. The longer your kernel executes, the more time the host thread spends waiting at the cudaDeviceSynchronize() call. So nearly all your host thread execution time appears to be spent on these calls.

For properly written single threaded, single (default) stream CUDA codes, cudaDeviceSynchronize() is almost never needed in the host thread. It may be useful in some cases for certain types of debugging/error checking, and it may be useful in the case where you have a kernel executing and want to see the printout (printf) from the kernel before your application terminates.

Upvotes: 4

Related Questions