Dynamic allocation in device makes the memory copy fails

Question

I am using CUDA driver API. The simplified problem description is as follows:

// .cu file, compile to ptx file.

extern "C" __global__ void SomeFunction(char* d_buffer) {
    float* p = malloc(sizeof(float) * 100); // Allocate memory per thread
    do some calculation with allocated memory. // About 5x10^5 threads.
    do some other calculation with d_buffer.
    free(p)
}

// .cpp file

int main()
{   // Allocate device buffer
    CUdeviceptr d_buffer;
    cuMemAlloc(&d_buffer, bytes);
    // Allocate host buffer 
    char* h_buffer = new char(bytes); 
    // copy host buffer to device buffer 
    cuMemcpyHtoD(h_buffer, d_buffer, bytes);

    CUfunction func;
    cuModuleGetFunction(&func, module, "SomeFunction");
    cuLaunchKernel(func, grid_dims,...,block_dims,...,args,...);
    // copy device buffer to host buffer 
    cuMemcpyDtoH(d_buffer, h_buffer, bytes); // Failed! 
}

The problem is the copy operation in the last line of the .cpp file FAILED. However, if I commented out the dynamic allocation (malloc, free) in .cu file, the copy operation will SUCCESS. My question is that is there any restriction using dynamic allocation in driver API? If so, what are those? How can I use the dynamic allocation correctly in driver API?

Dynamic allocation in device makes the memory copy fails

Answers (1)

Related Questions