Reputation: 21
I am debugging a MPI-based CUDA program with DDT. My code aborts when the CUDA runtime library (libcudart) throws an exception in the (undocumented) function cudaGetExportTable
, when called from cudaMalloc
and cudaThreadSynchronize
(UPDATED: using cudaDeviceSynchronize
gives the same error) in my code.
Why is libcudart throwing an exception (I am using the C API, not the C++ API) before I can detect it in my code with its cudaError_t
return value or with CHECKCUDAERROR
?
(I'm using CUDA 4.2 SDK for Linux.)
Output:
Process 9: terminate called after throwing an instance of 'cudaError_enum'
Process 9: terminate called recursively
Process 20: terminate called after throwing an instance of 'cudaError'
Process 20: terminate called recursively
My code:
cudaThreadSynchronize();
CHECKCUDAERROR("cudaThreadSynchronize()");
Other code fragment:
const size_t t; // from argument to function
void* p=NULL;
const cudaError_t r=cudaMalloc(&p, t);
if (r!=cudaSuccess) {
ERROR("cudaMalloc failed.");
}
Partial Backtrace:
Process 9:
cudaDeviceSynchronize()
-> cudaGetExportTable()
-> __cxa_throw
Process 20:
cudaMalloc()
-> cudaGetExportTable()
-> cudaGetExportTable()
-> __cxa_throw
Memory debugging errors:
Processes 0,2,4,6-9,15-17,20-21:
Memory error detected in Malloc_cuda_gx (cudamalloc.cu:35):
dmalloc bad admin structure list.
This line is the cudaMalloc code fragment shown above. Also:
Processes 1,3,5,10-11,13-14,18-19,23:
Memory error detected in vfprintf from /lib64/libc.so.6:
dmalloc bad admin structure list.
Also, when running on 3 cores/gpus per node instead of 4 gpus per node, dmalloc detects similar memory errors, but when not in debug mode, the code runs perfectly fine with 3 gpus per node (as far as I can tell).
Upvotes: 0
Views: 1007
Reputation: 21
Recompile with gcc. (I was using icc to compile my code.)
When you do this, the exception appears when debugging, but continuing past it, I get real CUDA errors:
Process 9: gadget_cuda_gx.cu:116: ERROR in gadget_cuda_gx.cu:919: CUDA ERROR: cudaThreadSynchronize(): unspecified launch failure
Process 20: cudamalloc.cu:38: ERROR all CUDA-capable devices are busy or unavailable, cudaMalloc failed to allocate 856792 bytes = 0.817101 Mb
Valgrind reveals no memory corruption or leaks in my code (either compiling with gcc or icc), but does find a few leaks in libcudart.
UPDATE: Still not fixed. Appears to be the same problem reported in answer #2 to this thread: cudaMemset fails on __device__ variable. The runtime isn't working like it should, it seems...
Upvotes: 1