Memory allocation in CUDA device is not what is expected

I cant make new tags, but it should be on MANAGEDCUDA tag, since im using that framework for using CUDA in C#.

I allocate 2 INT arrays with this code for testing:

Console.WriteLine("Cells: "+sum+" Expected Total Memory (x4): "+sum*4);
int temp= 0;
temp = cntxt.GetFreeDeviceMemorySize();
Console.Write("\n Memory available before:" + cntxt.GetFreeDeviceMemorySize() + "\n");
CudaDeviceVariable<int> matrix = new CudaDeviceVariable<int>(sum);
CudaDeviceVariable<int> matrixDir = new CudaDeviceVariable<int>(sum);
Console.Write("\n Memory available after allocation:" + cntxt.GetFreeDeviceMemorySize() + "\n");
Console.WriteLine("Memory took: "+(temp - cntxt.GetFreeDeviceMemorySize()));
Console.WriteLine("Diference between the expected and allocated: " + ((temp - cntxt.GetFreeDeviceMemorySize())-sum*8));

After run i got this in the console:

Console Run

Upvotes: 1

Views: 367

Answers (1)

user703016
user703016

Reputation: 37945

When you allocate memory through an allocator (malloc, cudaMalloc, ...), it needs to keep track of the bytes you allocated, in special metadata structures. This metadata may contain, for example, the number of bytes allocated and their location in memory, some padding to align the allocation, and buffer-overrun checks.

To reduce the management overhead, most modern allocators use pages, that is, they allocate memory in indivisible chunks of a fixed size. On many host systems, this size is by default 4 kB.

In your precise case, it would appear that CUDA serving your memory allocation requests in pages of 64 kB. That is, if you request 56 kB, CUDA will serve you 64 kB anyway, and the unused 8 kB are "wasted" (from the point of view of your application).

When you request an allocation of 1552516 bytes (that's 23.7 pages), the runtime will instead serve you 24 pages (1572864 bytes): that's 20348 bytes extra. Double that (because you have 2 arrays), and this is where your 40696 bytes difference comes from.

Note: The page size varies between GPUs and driver versions. You may try to find it out experimentally by yourself, or search for results published by other people. In any case, this is (to the best of my knowledge) not documented, and may therefore not be relied upon if you intend your program to be portable.

Upvotes: 6

Related Questions