Charlie
Charlie

Reputation: 33

Dynamic allocation in device makes the memory copy fails

I am using CUDA driver API. The simplified problem description is as follows:

// .cu file, compile to ptx file.

extern "C" __global__ void SomeFunction(char* d_buffer) {
    float* p = malloc(sizeof(float) * 100); // Allocate memory per thread
    do some calculation with allocated memory. // About 5x10^5 threads.
    do some other calculation with d_buffer.
    free(p)
}

// .cpp file

int main()
{   // Allocate device buffer
    CUdeviceptr d_buffer;
    cuMemAlloc(&d_buffer, bytes);
    // Allocate host buffer 
    char* h_buffer = new char(bytes); 
    // copy host buffer to device buffer 
    cuMemcpyHtoD(h_buffer, d_buffer, bytes);

    CUfunction func;
    cuModuleGetFunction(&func, module, "SomeFunction");
    cuLaunchKernel(func, grid_dims,...,block_dims,...,args,...);
    // copy device buffer to host buffer 
    cuMemcpyDtoH(d_buffer, h_buffer, bytes); // Failed! 
}

The problem is the copy operation in the last line of the .cpp file FAILED. However, if I commented out the dynamic allocation (malloc, free) in .cu file, the copy operation will SUCCESS. My question is that is there any restriction using dynamic allocation in driver API? If so, what are those? How can I use the dynamic allocation correctly in driver API?

Upvotes: 0

Views: 229

Answers (1)

talonmies
talonmies

Reputation: 72349

My question is that is there any restriction using dynamic allocation in driver API?

No more than in the runtime API.

How can I use the dynamic allocation correctly in driver API?

The important thing to realize is that the copy after the kernel is failing because the kernel itself is failing with errors at runtime.

As described in the programming guide, runtime kernel allocations come from a fixed sized heap, which defaults to 8Mb. If you exhaust that heap, malloc calls in the kernel will fail, and the call will return NULL. This is a condition you can test for. I guess you do not, and then your "do some calculation with allocated memory" dereferences a null pointer and blows up.

To rectify this in the driver API, you will need to call cuCtxSetLimit with the CU_LIMIT_MALLOC_HEAP_SIZE parameter and set that heap size to something more realistic (think the maximum number of resident threads on your device x number of bytes per thread rounded up to the nearest 16 byte page alignment, plus a safety margin). If you do that things will probably start working.

Upvotes: 2

Related Questions