DeviceToHost and HostToDevice times in case of CUDA Unified Memory

Question

I am trying to compare the total execution times taken by the managed and un-managed versions of the CUDA memory management.

In the following sample code, I have two functions that are doing exactly the same thing. The only different is in their memory management. One function uses the cudaMalloc()/cudaMemcpy() and the other method uses only the cudaMallocManaged().

I used the nvprof to compute different times and obtained the following outputs:

Managed Version nvprof Output:

== 29028 == Profiling result :
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities : 100.00 % 59.425us         1  59.425us  59.425us  59.425us  add(int, float*, float*)
API calls : 78.08 % 296.49ms         2  148.24ms  1.7127ms  294.78ms  cudaMallocManaged
19.61 % 74.451ms         1  74.451ms  74.451ms  74.451ms  cuDevicePrimaryCtxRelease
1.55 % 5.8705ms         1  5.8705ms  5.8705ms  5.8705ms  cudaLaunchKernel
0.67 % 2.5547ms         2  1.2774ms  974.40us  1.5803ms  cudaFree
0.07 % 280.60us         1  280.60us  280.60us  280.60us  cudaDeviceSynchronize
0.01 % 28.300us         3  9.4330us  3.0000us  13.300us  cuModuleUnload
0.01 % 26.800us         1  26.800us  26.800us  26.800us  cuDeviceTotalMem
0.00 % 17.700us       101     175ns     100ns     900ns  cuDeviceGetAttribute
0.00 % 10.100us         3  3.3660us     300ns  8.8000us  cuDeviceGetCount
0.00 % 3.2000us         1  3.2000us  3.2000us  3.2000us  cuDeviceGetName
0.00 % 3.0000us         2  1.5000us     300ns  2.7000us  cuDeviceGet
0.00 % 500ns         1     500ns     500ns     500ns  cuDeviceGetLuid
0.00 % 200ns         1     200ns     200ns     200ns  cuDeviceGetUuid

== 29028 == Unified Memory profiling result :
Device "GeForce GTX 1070 (0)"
Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
64  128.00KB  128.00KB  128.00KB  8.000000MB  3.279000ms  Host To Device
146  84.164KB  32.000KB  1.0000MB  12.00000MB  64.50870ms  Device To Host

Un-managed Version nvprof Output:

== 23864 == Profiling result :
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities : 56.30 % 1.5032ms         2  751.60us  751.44us  751.76us[CUDA memcpy HtoD]
41.48 % 1.1075ms         1  1.1075ms  1.1075ms  1.1075ms[CUDA memcpy DtoH]
2.23 % 59.457us         1  59.457us  59.457us  59.457us  add(int, float*, float*)
API calls : 78.92 % 270.08ms         2  135.04ms  656.40us  269.43ms  cudaMalloc
19.79 % 67.730ms         1  67.730ms  67.730ms  67.730ms  cuDevicePrimaryCtxRelease
1.05 % 3.5796ms         3  1.1932ms  1.0106ms  1.4341ms  cudaMemcpy
0.10 % 346.20us         2  173.10us  3.4000us  342.80us  cudaFree
0.09 % 314.30us         1  314.30us  314.30us  314.30us  cudaDeviceSynchronize
0.02 % 74.200us         1  74.200us  74.200us  74.200us  cudaLaunchKernel
0.01 % 34.700us         3  11.566us  2.5000us  29.100us  cuModuleUnload
0.01 % 24.100us         1  24.100us  24.100us  24.100us  cuDeviceTotalMem
0.00 % 17.100us       101     169ns     100ns     900ns  cuDeviceGetAttribute
0.00 % 9.0000us         3  3.0000us     300ns  8.0000us  cuDeviceGetCount
0.00 % 3.2000us         1  3.2000us  3.2000us  3.2000us  cuDeviceGetName
0.00 % 1.5000us         2     750ns     200ns  1.3000us  cuDeviceGet
0.00 % 300ns         1     300ns     300ns     300ns  cuDeviceGetUuid
0.00 % 300ns         1     300ns     300ns     300ns  cuDeviceGetLuid

My Code:

int RunManagedVersion()
{
    int N = 1 << 20;
    float* x, * y;

    // Allocate Unified Memory -- accessible from CPU or GPU
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    // Launch kernel on 1M elements on the GPU
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add << > > (N, x, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));

    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}

int RunUnmanagedVersion()
{
    int N = 1 << 20;

    //Declare pointers for input and output arrays
    float* x = (float*)calloc(N, sizeof(float));
    float* y = (float*)calloc(N, sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    //Allocate device memory for input and output images
    float* d_pX = 0;
    float* d_pY = 0;
    cudaMalloc(&d_pX, N * sizeof(float));
    cudaMalloc(&d_pY, N * sizeof(float));

    //Copy INPUT ARRAY data from host to device
    cudaMemcpy(d_pX, x, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_pY, y, N * sizeof(float), cudaMemcpyHostToDevice);


    // Launch kernel on 1M elements on the GPU
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add << > > (N, d_pX, d_pY);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    //Copy Results - Device to Host
    cudaMemcpy(y, d_pY, N * sizeof(float), cudaMemcpyDeviceToHost);

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));

    std::cout << "Max error: " << maxError << std::endl;

    // device memory free
    cudaFree(d_pX);
    cudaFree(d_pX);

    //host memory free
    free(x);
    free(y);

    return 0;
}

int main()
{
    RunUnmanagedVersion();

    //RunManagedVersion();

    return 0;
}

QUESTION: I have the above code several times and noted that the data transfer time from DeviceToHost is significantly higher in case of the managed version (i.e. Unified Memory). Is this normal (and why?) or am I doing anything wrong in the code?

DeviceToHost and HostToDevice times in case of CUDA Unified Memory

Answers (1)

Related Questions