user2712376
user2712376

Reputation: 25

Is device memory allocated using CudaMalloc inaccessible on the device with free?

I cannot deallocate memory on the host that I've allocated on the device or deallocate memory on the device that I allocated on the host. I'm using CUDA 5.5 with VS2012 and Nsight. Is it because the heap that's on the host is not transferred to the heap that's on the device or the other way around, so dynamic allocations are unknown between host and device?

If this is in the documentation, it is not easy to find. It's also important to note, an error wasn't thrown until I ran the program with CUDA debugging and with Memory Checker enabled. The problem did not cause a crash outside of CUDA debugging, but would've cause problems later if I hadn't checked for memory issues retroactively. If there's a handy way to copy the heap/stack from host to device, that'd be fantastic... hopes and dreams.

Here's an example for my question:

__global__ void kernel(char *ptr)
{
  free(ptr);
}

void main(void)
{
  char *ptr;
  cudaMalloc((void **)&ptr, sizeof(char *), cudaMemcpyHostToDevice);
  kernel<<<1, 1>>>(ptr);
}

Upvotes: 2

Views: 2184

Answers (2)

user2712376
user2712376

Reputation: 25

Here is my solution to mixing dynamic memory allocation on the host using CRT, with the host's CUDA API, and with the kernel memory functions. First off, as mentioned above, they all must be managed separately using strategy that does not require dynamic allocations to be transferred directly between system and device without prior communication and coordination. Manual data copies are required that do not validate against the kernel's device heap as noted in Robert's answer/comments.

I also suggest to keep track of, audit, the number of bytes allocated and deallocated in the 3 different memory management APIs. For instance, every time a system:malloc, host:cudaMalloc, device:malloc or associated frees are called, use a variable to hold the number of bytes allocated or deallocated in each heap, i.e. from system, host, device. This helps with tracking leaks when debugging.

The process is complex to dynamically allocate, manage, and audit memory between the system, host and device perspectives for deep dynamic structure copies. Here is a strategy that works, suggestions are welcomed:

  1. Allocate system memory using cudaHostMalloc or malloc of a structural type that contains pointers on the system heap;

  2. Allocate device memory from host for the struct, and copy the structure to the device (i.e. cudaMalloc, cudaMemcpy, etc.);

  3. From within a kernel, use malloc to create a memory allocation managed using the device heap and save the pointer(s) in the structure that exists on the device from step 2;

  4. Communicate what was allocated by the kernel to system by exchanging the size of the allocations for each of the pointers in the struct;

  5. Host performs the same allocation on the device using CUDA API (i.e. cudaMalloc) from the system as was done by the kernel on the device, recommended to have a separate pointer variable in the structure for this;

  6. At this point, the memory allocated dynamically from the kernel in device memory can be manually copied to the location dynamically allocated by the host in device memory (i.e. not using host:memcpy, device:memcpy or cudaMemcpy);

  7. Kernel cleans up memory allocations; and,

  8. Host uses cudaMemcpy to move the structure from the device, a similar strategy outlined in the above answer's comment can be used as necessary for deep copies.

Note, cudaHostMalloc and system:malloc (or cudaHostMalloc) both share the same system heap, making system heap and host heap the same and interoperable, as mentioned in the CUDA guide, referenced above. Therefore, only system heap and device heap are mentioned.

Upvotes: 0

Robert Crovella
Robert Crovella

Reputation: 151944

No you can't do this.

This topic is specifically covered in the programming guide here

Memory allocated via malloc() cannot be freed using the runtime (i.e., by calling any of the free memory functions from Device Memory). Similarly, memory allocated via the runtime (i.e., by calling any of the memory allocation functions from Device Memory) cannot be freed via free().

It's in section B.18.2 of the programming guide, within section B.18 "B.18. Dynamic Global Memory Allocation and Operations".

The basic reason for it is that the mechanism used to reserve allocations using the runtime (e.g. cudaMalloc, cudaFree) is separate from the device code allocator, and in fact they reserve out of logically separate regions of global memory.

You may want to read the entire B.18 section of the programming guide, which covers these topics on device dynamic memory allocation.

Upvotes: 4

Related Questions