Reputation: 5659
I am trying to compare the total execution times taken by the managed and un-managed versions of the CUDA memory management.
In the following sample code, I have two functions that are doing exactly the same thing. The only different is in their memory management. One function uses the cudaMalloc()
/cudaMemcpy()
and the other method uses only the cudaMallocManaged()
.
I used the nvprof
to compute different times and obtained the following outputs:
Managed Version nvprof
Output:
== 29028 == Profiling result :
Type Time(%) Time Calls Avg Min Max Name
GPU activities : 100.00 % 59.425us 1 59.425us 59.425us 59.425us add(int, float*, float*)
API calls : 78.08 % 296.49ms 2 148.24ms 1.7127ms 294.78ms cudaMallocManaged
19.61 % 74.451ms 1 74.451ms 74.451ms 74.451ms cuDevicePrimaryCtxRelease
1.55 % 5.8705ms 1 5.8705ms 5.8705ms 5.8705ms cudaLaunchKernel
0.67 % 2.5547ms 2 1.2774ms 974.40us 1.5803ms cudaFree
0.07 % 280.60us 1 280.60us 280.60us 280.60us cudaDeviceSynchronize
0.01 % 28.300us 3 9.4330us 3.0000us 13.300us cuModuleUnload
0.01 % 26.800us 1 26.800us 26.800us 26.800us cuDeviceTotalMem
0.00 % 17.700us 101 175ns 100ns 900ns cuDeviceGetAttribute
0.00 % 10.100us 3 3.3660us 300ns 8.8000us cuDeviceGetCount
0.00 % 3.2000us 1 3.2000us 3.2000us 3.2000us cuDeviceGetName
0.00 % 3.0000us 2 1.5000us 300ns 2.7000us cuDeviceGet
0.00 % 500ns 1 500ns 500ns 500ns cuDeviceGetLuid
0.00 % 200ns 1 200ns 200ns 200ns cuDeviceGetUuid
== 29028 == Unified Memory profiling result :
Device "GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
64 128.00KB 128.00KB 128.00KB 8.000000MB 3.279000ms Host To Device
146 84.164KB 32.000KB 1.0000MB 12.00000MB 64.50870ms Device To Host
Un-managed Version nvprof
Output:
== 23864 == Profiling result :
Type Time(%) Time Calls Avg Min Max Name
GPU activities : 56.30 % 1.5032ms 2 751.60us 751.44us 751.76us[CUDA memcpy HtoD]
41.48 % 1.1075ms 1 1.1075ms 1.1075ms 1.1075ms[CUDA memcpy DtoH]
2.23 % 59.457us 1 59.457us 59.457us 59.457us add(int, float*, float*)
API calls : 78.92 % 270.08ms 2 135.04ms 656.40us 269.43ms cudaMalloc
19.79 % 67.730ms 1 67.730ms 67.730ms 67.730ms cuDevicePrimaryCtxRelease
1.05 % 3.5796ms 3 1.1932ms 1.0106ms 1.4341ms cudaMemcpy
0.10 % 346.20us 2 173.10us 3.4000us 342.80us cudaFree
0.09 % 314.30us 1 314.30us 314.30us 314.30us cudaDeviceSynchronize
0.02 % 74.200us 1 74.200us 74.200us 74.200us cudaLaunchKernel
0.01 % 34.700us 3 11.566us 2.5000us 29.100us cuModuleUnload
0.01 % 24.100us 1 24.100us 24.100us 24.100us cuDeviceTotalMem
0.00 % 17.100us 101 169ns 100ns 900ns cuDeviceGetAttribute
0.00 % 9.0000us 3 3.0000us 300ns 8.0000us cuDeviceGetCount
0.00 % 3.2000us 1 3.2000us 3.2000us 3.2000us cuDeviceGetName
0.00 % 1.5000us 2 750ns 200ns 1.3000us cuDeviceGet
0.00 % 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
0.00 % 300ns 1 300ns 300ns 300ns cuDeviceGetLuid
My Code:
int RunManagedVersion()
{
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add << <numBlocks, blockSize >> > (N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
int RunUnmanagedVersion()
{
int N = 1 << 20;
//Declare pointers for input and output arrays
float* x = (float*)calloc(N, sizeof(float));
float* y = (float*)calloc(N, sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
//Allocate device memory for input and output images
float* d_pX = 0;
float* d_pY = 0;
cudaMalloc(&d_pX, N * sizeof(float));
cudaMalloc(&d_pY, N * sizeof(float));
//Copy INPUT ARRAY data from host to device
cudaMemcpy(d_pX, x, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_pY, y, N * sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add << <numBlocks, blockSize >> > (N, d_pX, d_pY);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
//Copy Results - Device to Host
cudaMemcpy(y, d_pY, N * sizeof(float), cudaMemcpyDeviceToHost);
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// device memory free
cudaFree(d_pX);
cudaFree(d_pX);
//host memory free
free(x);
free(y);
return 0;
}
int main()
{
RunUnmanagedVersion();
//RunManagedVersion();
return 0;
}
QUESTION: I have the above code several times and noted that the data transfer time from DeviceToHost
is significantly higher in case of the managed version (i.e. Unified Memory). Is this normal (and why?) or am I doing anything wrong in the code?
Upvotes: 0
Views: 280
Reputation: 151934
I believe the things I'm going to say here are:
Furthermore, all these comments are specific to the UM regime associated with either windows UM usage, or linux usage with a pre-pascal GPU.
In this pre-pascal and/or windows UM regime, transfer of data from host to device is initiated at the point of kernel launch. This will manifest itself as latency in the kernel launch process (from the time the kernel launch was requested until the time the kernel code actually begins executing.
In this scenario, the UM system transfers the data in fixed size chunks. This is evident from your profiler output:
Count Avg Size Min Size Max Size Total Size Total Time Name
64 128.00KB 128.00KB 128.00KB 8.000000MB 3.279000ms Host To Device
We would conclude that since there is indeed 8MB of data that needs to be transferred to the device, and in the non-UM case it seems to happen in ~1.5ms:
GPU activities : 56.30 % 1.5032ms 2 751.60us 751.44us 751.76us[CUDA memcpy HtoD]
that even the H->D case is somewhat less performant on windows as compared to the non-UM case. I attribute this to the need (for whatever reason) to transfer the 8MB of data in relatively small 128KB chunks. In addition, WDDM has direct control over this GPU in the windows case, and CUDA is actually a "client" of WDDM for these activities, especially related to memory. It's entirely possible that WDDM decided it wanted to do something with/to the GPU during the data transfer, and may have inserted some gaps or inefficiency.
In the D->H case, on windows, the situation seems to be different, and arguably worse. However we have to be careful to assess what is going on here. The first question might be:
Why is 12MB of data being transferred D->H?
There appear to be a few things to note:
The second bullet above means that we could presume that D->H transfers might be:
We also could conclude that only 1/3 of the reported UM D->H activity is actually occurring after the kernel call, and so we might choose to only compare that portion to the D->H report from the non-UM case.
The net of all this is that I don't think its a trivial matter to compare the two cases simply by looking at the type of data I have excerpted above. Yes, the UM case probably does perform worse than the non-UM case. There is nowhere in the CUDA documentation stating that these are expected to be performance identical. No, you are not doing anything "wrong".
FWIW, the maxwell/kepler UM case on linux looks much better than it does on windows WDDM, so I think WDDM is probably involved in the less efficient behavior, also.
Upvotes: 1