skm
skm

Reputation: 5659

DeviceToHost and HostToDevice times in case of CUDA Unified Memory

I am trying to compare the total execution times taken by the managed and un-managed versions of the CUDA memory management.

In the following sample code, I have two functions that are doing exactly the same thing. The only different is in their memory management. One function uses the cudaMalloc()/cudaMemcpy() and the other method uses only the cudaMallocManaged().

I used the nvprof to compute different times and obtained the following outputs:

Managed Version nvprof Output:

== 29028 == Profiling result :
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities : 100.00 % 59.425us         1  59.425us  59.425us  59.425us  add(int, float*, float*)
API calls : 78.08 % 296.49ms         2  148.24ms  1.7127ms  294.78ms  cudaMallocManaged
19.61 % 74.451ms         1  74.451ms  74.451ms  74.451ms  cuDevicePrimaryCtxRelease
1.55 % 5.8705ms         1  5.8705ms  5.8705ms  5.8705ms  cudaLaunchKernel
0.67 % 2.5547ms         2  1.2774ms  974.40us  1.5803ms  cudaFree
0.07 % 280.60us         1  280.60us  280.60us  280.60us  cudaDeviceSynchronize
0.01 % 28.300us         3  9.4330us  3.0000us  13.300us  cuModuleUnload
0.01 % 26.800us         1  26.800us  26.800us  26.800us  cuDeviceTotalMem
0.00 % 17.700us       101     175ns     100ns     900ns  cuDeviceGetAttribute
0.00 % 10.100us         3  3.3660us     300ns  8.8000us  cuDeviceGetCount
0.00 % 3.2000us         1  3.2000us  3.2000us  3.2000us  cuDeviceGetName
0.00 % 3.0000us         2  1.5000us     300ns  2.7000us  cuDeviceGet
0.00 % 500ns         1     500ns     500ns     500ns  cuDeviceGetLuid
0.00 % 200ns         1     200ns     200ns     200ns  cuDeviceGetUuid

== 29028 == Unified Memory profiling result :
Device "GeForce GTX 1070 (0)"
Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
64  128.00KB  128.00KB  128.00KB  8.000000MB  3.279000ms  Host To Device
146  84.164KB  32.000KB  1.0000MB  12.00000MB  64.50870ms  Device To Host

Un-managed Version nvprof Output:

== 23864 == Profiling result :
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities : 56.30 % 1.5032ms         2  751.60us  751.44us  751.76us[CUDA memcpy HtoD]
41.48 % 1.1075ms         1  1.1075ms  1.1075ms  1.1075ms[CUDA memcpy DtoH]
2.23 % 59.457us         1  59.457us  59.457us  59.457us  add(int, float*, float*)
API calls : 78.92 % 270.08ms         2  135.04ms  656.40us  269.43ms  cudaMalloc
19.79 % 67.730ms         1  67.730ms  67.730ms  67.730ms  cuDevicePrimaryCtxRelease
1.05 % 3.5796ms         3  1.1932ms  1.0106ms  1.4341ms  cudaMemcpy
0.10 % 346.20us         2  173.10us  3.4000us  342.80us  cudaFree
0.09 % 314.30us         1  314.30us  314.30us  314.30us  cudaDeviceSynchronize
0.02 % 74.200us         1  74.200us  74.200us  74.200us  cudaLaunchKernel
0.01 % 34.700us         3  11.566us  2.5000us  29.100us  cuModuleUnload
0.01 % 24.100us         1  24.100us  24.100us  24.100us  cuDeviceTotalMem
0.00 % 17.100us       101     169ns     100ns     900ns  cuDeviceGetAttribute
0.00 % 9.0000us         3  3.0000us     300ns  8.0000us  cuDeviceGetCount
0.00 % 3.2000us         1  3.2000us  3.2000us  3.2000us  cuDeviceGetName
0.00 % 1.5000us         2     750ns     200ns  1.3000us  cuDeviceGet
0.00 % 300ns         1     300ns     300ns     300ns  cuDeviceGetUuid
0.00 % 300ns         1     300ns     300ns     300ns  cuDeviceGetLuid

My Code:

int RunManagedVersion()
{
    int N = 1 << 20;
    float* x, * y;

    // Allocate Unified Memory -- accessible from CPU or GPU
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    // Launch kernel on 1M elements on the GPU
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add << <numBlocks, blockSize >> > (N, x, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));

    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}

int RunUnmanagedVersion()
{
    int N = 1 << 20;

    //Declare pointers for input and output arrays
    float* x = (float*)calloc(N, sizeof(float));
    float* y = (float*)calloc(N, sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    //Allocate device memory for input and output images
    float* d_pX = 0;
    float* d_pY = 0;
    cudaMalloc(&d_pX, N * sizeof(float));
    cudaMalloc(&d_pY, N * sizeof(float));

    //Copy INPUT ARRAY data from host to device
    cudaMemcpy(d_pX, x, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_pY, y, N * sizeof(float), cudaMemcpyHostToDevice);


    // Launch kernel on 1M elements on the GPU
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add << <numBlocks, blockSize >> > (N, d_pX, d_pY);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    //Copy Results - Device to Host
    cudaMemcpy(y, d_pY, N * sizeof(float), cudaMemcpyDeviceToHost);

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));

    std::cout << "Max error: " << maxError << std::endl;

    // device memory free
    cudaFree(d_pX);
    cudaFree(d_pX);

    //host memory free
    free(x);
    free(y);

    return 0;
}

int main()
{
    RunUnmanagedVersion();

    //RunManagedVersion();

    return 0;
}

QUESTION: I have the above code several times and noted that the data transfer time from DeviceToHost is significantly higher in case of the managed version (i.e. Unified Memory). Is this normal (and why?) or am I doing anything wrong in the code?

Upvotes: 0

Views: 280

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151934

I believe the things I'm going to say here are:

  1. undocumented, therefore subject to change, but
  2. observable with careful profiling

Furthermore, all these comments are specific to the UM regime associated with either windows UM usage, or linux usage with a pre-pascal GPU.

In this pre-pascal and/or windows UM regime, transfer of data from host to device is initiated at the point of kernel launch. This will manifest itself as latency in the kernel launch process (from the time the kernel launch was requested until the time the kernel code actually begins executing.

In this scenario, the UM system transfers the data in fixed size chunks. This is evident from your profiler output:

Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
64  128.00KB  128.00KB  128.00KB  8.000000MB  3.279000ms  Host To Device

We would conclude that since there is indeed 8MB of data that needs to be transferred to the device, and in the non-UM case it seems to happen in ~1.5ms:

GPU activities : 56.30 % 1.5032ms         2  751.60us  751.44us  751.76us[CUDA memcpy HtoD]

that even the H->D case is somewhat less performant on windows as compared to the non-UM case. I attribute this to the need (for whatever reason) to transfer the 8MB of data in relatively small 128KB chunks. In addition, WDDM has direct control over this GPU in the windows case, and CUDA is actually a "client" of WDDM for these activities, especially related to memory. It's entirely possible that WDDM decided it wanted to do something with/to the GPU during the data transfer, and may have inserted some gaps or inefficiency.

In the D->H case, on windows, the situation seems to be different, and arguably worse. However we have to be careful to assess what is going on here. The first question might be:

Why is 12MB of data being transferred D->H?

There appear to be a few things to note:

  • A UM allocation seems to appear first in device memory. (This is different than the demand paged case!) This means that if the first thing you do is access that UM allocation in host code, then the allocation has to be transferred from device to host. This accounts for 8MB of the 12MB of transfer, and if you do careful profiling experiments, you can convince yourself of this.
  • A UM allocation on the device seems to transfer to host code based on host code activity. This is self evident if the first bullet above is considered carefully. But even if we only pay attention to the data transfer after the kernel activity, its easy via profiler experimentation to convince ourselves that if no host code after the kernel launch actually accesses the data, no transfers will occur there.

The second bullet above means that we could presume that D->H transfers might be:

  1. "smeared out" over the duration of the host code that is actually somehow causing these transfers
  2. somehow happening "concurrently" with the host code.

We also could conclude that only 1/3 of the reported UM D->H activity is actually occurring after the kernel call, and so we might choose to only compare that portion to the D->H report from the non-UM case.

The net of all this is that I don't think its a trivial matter to compare the two cases simply by looking at the type of data I have excerpted above. Yes, the UM case probably does perform worse than the non-UM case. There is nowhere in the CUDA documentation stating that these are expected to be performance identical. No, you are not doing anything "wrong".

FWIW, the maxwell/kepler UM case on linux looks much better than it does on windows WDDM, so I think WDDM is probably involved in the less efficient behavior, also.

Upvotes: 1

Related Questions