talonmies
talonmies

Reputation: 72353

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.

I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like

__device__ char* buffer;

but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:

  1. What is the lifetime of the various methods of allocating global memory?
  2. How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?

[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]

Upvotes: 0

Views: 1501

Answers (1)

talonmies
talonmies

Reputation: 72353

What is the lifetime of the various methods of allocating global memory?

All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.

How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?

Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.

Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.

The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:

__device__ char* buffer;

int main()  
{
    char* d_buffer;
    const size_t buffer_sz = 800 * 600 * sizeof(char);

    // Allocate memory
    cudaMalloc(&d_buffer, buffer_sz);

    // Zero memory and assign to global device symbol
    cudaMemset(d_buffer, 0, buffer_sz);
    cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));

    // Kernels go here using buffer

    // copy to host
    std::vector<char> results(800*600);
    cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);

    // buffer has lifespan until free'd here
    cudaFree(d_buffer);

    return 0;
  };

[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]

So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

Upvotes: 2

Related Questions