Reputation: 161
I need to measure the time difference between allocating normal CPU memory with new
and a call to cudaMallocManaged
. We are working with unified memory and are trying to figure out the trade-offs of switching things to cudaMallocManaged
. (The kernels seem to run a lot slower, likely due to a lack of caching or something.)
Anyway, I am not sure the best way to time these allocations. Would one of boost's process_real_cpu_clock
, process_user_cpu_clock
, or process_system_cpu_clock
give me the best results? Or should I just use the regular system time call in C++11? Or should I use the cudaEvent stuff for timing?
I figure that I shouldn't use the cuda events, because they are for timing GPU processes and would not be acurate for timing cpu calls (correct me if I am wrong there.) If I could use the cudaEvents on just the mallocManaged one, what would be most accurate to compare against when timing the new
call? I just don't know enough about memory allocation and timing. Everything I read seems to just make me more confused due to boost's and nvidia's shoddy documentation.
Upvotes: 1
Views: 1126
Reputation: 3127
You can use CUDA events to measure the time of functions executed in the host.
cudaEventElapsedTime
computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).
Read more at: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html
In addition, if you are also interested in timing your kernel execution time, you will find that the CUDA event API automatically blocks the execution of your code and waits until any asynchronous call ends (like a kernel call).
In any case, you should use the same metrics (always CUDA events, or boost, or your own timing) to ensure the same resolution and overhead.
The profiler `nvprof' shipped with the CUDA toolkit may help to understand and optimize the performance of your CUDA application.
Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html
Upvotes: 3
Reputation: 218700
I recommend:
auto t0 = std::chrono::high_resolution_clock::now();
// what you want to measure
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";
This will output the difference in seconds represented as a double
.
Allocation algorithms usually optimize themselves as they go along. That is, the first allocation is often more expensive than the second because caches of memory are created during the first in anticipation of the second. So you may want to put the thing you're timing in a loop, and average the results.
Some implementations of std::chrono::high_resolution_clock
have been less than spectacular, but are improving with time. You can assess your implementation with:
auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";
That is, how fast can your implementation get the current time? If it is slow, two successive calls will demonstrate a large time in-between. On my system (at -O3) this outputs on the order of:
1.2e-07s
which means I can time something that takes on the order of 1 microsecond. To get a finer measurement than that I have to loop over many operations, and divide by the number of operations, subtracting out the loop overhead if that would be significant.
If your implementation of std::chrono::high_resolution_clock
appears to be unsatisfactory, you may be able to build your own chrono
clock along the lines of this. The disadvantage is obviously a bit of non-portable work. However you get the std::chrono
duration
and time_point
infrastructure for free (time arithmetic and units conversion).
Upvotes: 1