QuaternionsRock
QuaternionsRock

Reputation: 912

What is the difference between mapped memory and managed memory?

I have been reading about the various approaches to memory management offered by CUDA, and I'm struggling to understand the difference between mapped memory:

int *foo;
std::size_t size = 32;
cudaHostAlloc(&foo, size, cudaHostAllocMapped);

...and managed memory:

int *foo;
std::size_t size = 32;
cudaMallocManaged(&foo, size);

They both appear to implicitly transfer memory between the host and device. cudaMallocManaged seems to be the newer API, and it uses the so-called "Unified Memory" system. That said, cudaHostAlloc seems to share many of these properties on 64-bit systems thanks to the unified virtual address space.

There seem to be a few other differences in documentation, but I am not confident that the absence of explicit feature documentation will lead me to a correct understanding of the differences between these two functions (e.g. I don't believe it is explicitly stated that cudaMallocManaged's host memory is page-locked, but I suspect that it is).

They also correspond to different functions in the driver API (cuMemHostAlloc and cuMemAllocManaged), which I think is a good indicator that their behavior differs in some meaningful way.

Upvotes: 5

Views: 2913

Answers (1)

Abator Abetor
Abator Abetor

Reputation: 2588

I think the main difference is the paging/ page-fault mechanism.

Pinned memory acts the same as ordinary device memory. If one byte of pinned memory is requested, one byte will be transparently transfered to the GPU via PCIe bus. (Maybe the driver merges requests of contiguous memory locations, I do not know.)

On the other hand, managed memory has access granularity of memory pages. If the page of the requested byte is not present on the device, not only the single byte but the whole page (4096 bytes on many systems) is migrated to the GPU from its current location, which can be host memory, or device memory of another GPU.

The following program tries to show the different behaviours. 256 MB are allocated which is equivalent to 64 * 1024 pages of size 4096 bytes. Then, in a kernel each thread accesses the first byte of each page, i.e each 4096th byte. The time is measured for pinned memory, managed memory, and normal device memory.

#include <iostream>
#include <cassert>

__global__
void kernel(char* __restrict__ data, int pagesize, int numpages){
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if(tid < numpages){
        data[tid * pagesize] += 1;
    }
}

int main(){
    const int pagesize = 4096;
    const int numpages = 1024 * 64;
    const int bytes = pagesize * numpages;
    cudaError_t status = cudaSuccess;
    float elapsed = 0.0f;
    const int iterations = 5;

    char* devicedata; 
    status = cudaMalloc(&devicedata, bytes);
    assert(status == cudaSuccess);

    char* pinneddata; 
    status = cudaMallocHost(&pinneddata, bytes);
    assert(status == cudaSuccess);

    char* manageddata;
    status = cudaMallocManaged(&manageddata, bytes);
    assert(status == cudaSuccess);

    status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
    //status = cudaMemPrefetchAsync(manageddata, bytes, 0);
    assert(status == cudaSuccess);

    cudaEvent_t event1, event2;
    cudaEventCreate(&event1);
    cudaEventCreate(&event2);

    for(int iteration = 0; iteration < iterations; iteration++){
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(pinneddata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);
        
        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "pinned: " << elapsed << ", throughput " << bandwith << " GB/s" << "\n";
    }

    for(int iteration = 0; iteration < iterations; iteration++){
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(manageddata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);

        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "managed: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";

        status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
        assert(status == cudaSuccess);     
    }

    for(int iteration = 0; iteration < iterations; iteration++){
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(devicedata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);
        
        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "device: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";
    }

    cudaFreeHost(pinneddata);
    cudaFree(manageddata);
    cudaFree(devicedata);
    cudaEventDestroy(event1);
    cudaEventDestroy(event2);

}

When the managed memory is prefetch to the host, the following times are observed

pinned: 1.4577 ms, throughput 42.8759 MB/s
pinned: 1.4927 ms, throughput 41.8703 MB/s
pinned: 1.44947 ms, throughput 43.1192 MB/s
pinned: 1.44371 ms, throughput 43.2912 MB/s
pinned: 1.4496 ms, throughput 43.1153 MB/s
managed: 40.3646 ms, throughput 1.54839 MB/s
managed: 35.8052 ms, throughput 1.74555 MB/s
managed: 36.7788 ms, throughput 1.69935 MB/s
managed: 37.3166 ms, throughput 1.67486 MB/s
managed: 35.3378 ms, throughput 1.76864 MB/s
device: 0.052256 ms, throughput 1196.03 MB/s
device: 0.061312 ms, throughput 1019.38 MB/s
device: 0.060736 ms, throughput 1029.04 MB/s
device: 0.060096 ms, throughput 1040 MB/s
device: 0.060352 ms, throughput 1035.59 MB/s

nvprof confirms that in the case of managed memory, all 256 MB are transfered to the device.

==27443== Unified Memory profiling result:
Device "TITAN Xp (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    6734  38.928KB  4.0000KB  776.00KB  256.0000MB  29.95677ms  Host To Device

When we remove the prefetching within the loop, the migrated pages remain on the GPU, which improves access time to the level of normal device memory.

pinned: 1.46848 ms, throughput 42.561 MB/s
pinned: 1.50842 ms, throughput 41.4342 MB/s
pinned: 1.44285 ms, throughput 43.3171 MB/s
pinned: 1.45802 ms, throughput 42.8665 MB/s
pinned: 1.4431 ms, throughput 43.3094 MB/s
managed: 41.9972 ms, throughput 1.4882 MB/s  <--- need to migrate pages
managed: 0.047584 ms, throughput 1313.47 MB/s <--- pages already present on GPU
managed: 0.059552 ms, throughput 1049.5 MB/s
managed: 0.057248 ms, throughput 1091.74 MB/s
managed: 0.062336 ms, throughput 1002.63 MB/s
device: 0.06176 ms, throughput 1011.98 MB/s
device: 0.062592 ms, throughput 998.53 MB/s
device: 0.062176 ms, throughput 1005.21 MB/s
device: 0.06128 ms, throughput 1019.91 MB/s
device: 0.063008 ms, throughput 991.937 MB/s

Upvotes: 6

Related Questions