How to trap a bug in CUDA that reset the machine

Question

I've implemented a kernel that compute distances among vectors. The program run as expected and the results are the same as in CPU. The program free the resources used in the device ( cudaFree ) and exit normally. In addition, before exit I use a cudaDeviceReset().

All the CUDA API calls are wrapped to check for errors as in the Eclipse Nsight API Example. No errors are reported in the execution of the program.

The kernel check for memory positions indexes before perform a read or write access to the global memory, i.e., if ( idx < N ) ...

In the CPU, a loop is execute p times performing a cudaMalloc and a cudaMemcpy(HtoD) before call the kernel, and a cudaFree() before the next iteration. A cudaDeviceSynchronize() is placed after the kernel and before the cudaFree call to wait for the GPU launched work to complete.

The cuda-memcheck does not report any errors when analyzing the program in Release and Debug mode.

However, sometimes the computer restarts when running the program and I have not found any repeating pattern to track the error. So, my question is: How could i trap this error?

I'm using CUDA release 5.0, V0.2.1221 in Ubuntu x86_64 GNU/Linux with the X System running. The device is a GTX480 and the driver version installed is 304.54.

pQB · Accepted Answer

It is a problem related to the device temperature.

Following the comment of @Robert Crovella I executed the kernel in an x86_64 GNU/Linux dedicated server (no X System running), also with CUDA 5 but with a GTX680. The program run successfully all the times.

I traced the GPU memory used and the temperature using the nvidia-smi command and found that my computer does a reset when the temperature exceeds 70 degrees.

So, the problem is not related with any memory-leak or memory access violation but with the intensive use of the device.

How to trap a bug in CUDA that reset the machine

Answers (1)

Related Questions