CUDA error checking on cudaDeviceSynchronize after kernel launch may not catch every error?

Question

I recently found a comment at the @talonmies accepted answer stating the following:

Note that, unlike all other CUDA errors, kernel launch errors will not be reported by subsequent synchronizing calls to the CUDA runtime API. Just putting gpuErrchk() around the next cudaMemcpy() or cudaDeviceSynchronize() call is thus insufficient to catch all possible error conditions. I'd argue it is better style to call cudaGetLastError() instead of cudaPeekAtLastError() immediately after a kernel launch` even though they have the same effect, to aid the unwitting reader.

My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?

I always do error checking around API calls and after a kernel launch I call cudaDeviceSynchronize (since my kernels take way longer than the data transfer so I have no significant performance loss) and I thought I was safe this way. In what scenarios could this approach fail?

Robert Crovella · Accepted Answer

A description of asynchronous (sticky) vs. synchronous (non-sticky) errors is covered here. Furthermore, I cover this exact topic in some detail in this online training series unit 12.

Please familiarize yourself with those. I'm not going to give a full recital or repeat of all related ideas here.

My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?

No.

Most CUDA runtime API calls, including such examples as cudaMemcpy(), cudaMalloc() cudaStreamCreate(), cudaDeviceSynchronize() and many others will return errors that fit the following descriptions:

Any previously occurring asynchronous error. Such errors occur during the execution of device code, and they corrupt the CUDA context, and they cannot be cleared except by destruction of the underlying CUDA context.
Any synchronous error that occurs as the result of the runtime call itself, only.

That means if I call cudaMemcpy(), it will report any async errors per item 1, and any synchronous errors that occur as a result of the cudaMemcpy() call, not any other. Likewise for cudaDeviceSynchronize().

So what is missed?

A synchronous error as a result of a kernel call. For example:

mykernel<<<1,1025>>>(...);

We immediately know that such a launch cannot proceed, because 1025 threads per block is illegal in CUDA (currently). An error of that type is not occurring as a result of device code execution but rather as a result of inspection of the kernel launch request. It is a synchronous error, not asynchronous.

If you do this:

__global__ void mykernel(){}

int main(){
  mykernel<<<1,1025>>>();
  cudaError_t err = cudaDeviceSynchronize();
  }

the err variable will contain cudaSuccess (more accurately, the enum token that corresponds to cudaSuccess, and likewise for all other such references in this answer). On the other hand if you do this:

int main(){
  mykernel<<<1,1025>>>();
  cudaError_t err = cudaGetLastError();
  cudaDeviceSynchronize();
  }

the err variable will contain something like cudaErrorInvalidConfiguration (enum token 9, see here).

You might wish to try this experiment yourself.

Anyway, this answer has been carefully crafted, by an expert. I personally wouldn't discount any part of it.

Yes, cudaGetLastError() (and cudaPeekAtLastError()) behavior is different than most other cuda runtime API call error reporting that I described in the section containing items 1 and 2 above. They will (among other things) report a synchronous error, from another previously occurring runtime API call (or kernel launch) that has not yet been cleared.

CUDA error checking on cudaDeviceSynchronize after kernel launch may not catch every error?

Answers (1)

Related Questions