Reputation: 33
I am using CUDA driver API. The simplified problem description is as follows:
// .cu file, compile to ptx file.
extern "C" __global__ void SomeFunction(char* d_buffer) {
float* p = malloc(sizeof(float) * 100); // Allocate memory per thread
do some calculation with allocated memory. // About 5x10^5 threads.
do some other calculation with d_buffer.
free(p)
}
// .cpp file
int main()
{ // Allocate device buffer
CUdeviceptr d_buffer;
cuMemAlloc(&d_buffer, bytes);
// Allocate host buffer
char* h_buffer = new char(bytes);
// copy host buffer to device buffer
cuMemcpyHtoD(h_buffer, d_buffer, bytes);
CUfunction func;
cuModuleGetFunction(&func, module, "SomeFunction");
cuLaunchKernel(func, grid_dims,...,block_dims,...,args,...);
// copy device buffer to host buffer
cuMemcpyDtoH(d_buffer, h_buffer, bytes); // Failed!
}
The problem is the copy operation in the last line of the .cpp file FAILED. However, if I commented out the dynamic allocation (malloc, free) in .cu file, the copy operation will SUCCESS. My question is that is there any restriction using dynamic allocation in driver API? If so, what are those? How can I use the dynamic allocation correctly in driver API?
Upvotes: 0
Views: 229
Reputation: 72349
My question is that is there any restriction using dynamic allocation in driver API?
No more than in the runtime API.
How can I use the dynamic allocation correctly in driver API?
The important thing to realize is that the copy after the kernel is failing because the kernel itself is failing with errors at runtime.
As described in the programming guide, runtime kernel allocations come from a fixed sized heap, which defaults to 8Mb. If you exhaust that heap, malloc
calls in the kernel will fail, and the call will return NULL
. This is a condition you can test for. I guess you do not, and then your "do some calculation with allocated memory" dereferences a null pointer and blows up.
To rectify this in the driver API, you will need to call cuCtxSetLimit with the CU_LIMIT_MALLOC_HEAP_SIZE
parameter and set that heap size to something more realistic (think the maximum number of resident threads on your device x number of bytes per thread rounded up to the nearest 16 byte page alignment, plus a safety margin). If you do that things will probably start working.
Upvotes: 2