Reputation: 10222
I'm writing a function that does a lot of BLAS gemv operations.
I would like to be able to do this on the GPU, and I've tried with cuBlas.
My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.
Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.
That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.
How would I go about doing this?
Upvotes: 4
Views: 4725
Reputation: 21521
Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks
The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.
N FFTw (ms) GPUFFT (ms) GPUFFT MFLOPS GPUFFT Speedup 8 0 0.06 3.352705 0.006881 16 0.001 0.065 7.882117 0.010217 32 0.001 0.075 17.10887 0.014695 64 0.002 0.085 36.080118 0.026744 128 0.004 0.093 76.724324 0.040122 256 0.007 0.107 153.739856 0.066754 512 0.015 0.115 320.200892 0.134614 1024 0.034 0.125 657.735381 0.270512 2048 0.076 0.156 1155.151507 0.484331 4096 0.173 0.215 1834.212989 0.804558 8192 0.483 0.32 2664.042421 1.510011 16384 1.363 0.605 3035.4551 2.255411 32768 3.168 1.14 3450.455808 2.780041 65536 8.694 2.464 3404.628083 3.528726 131072 15.363 5.027 3545.850483 3.05604 262144 33.223 12.513 3016.885246 2.655183 524288 72.918 25.879 3079.443664 2.817667 1048576 173.043 76.537 2192.056517 2.260904 2097152 331.553 157.427 2238.01491 2.106081 4194304 801.544 430.518 1715.573229 1.861814
The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.
If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).
Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).
Hope this helps,
Upvotes: 8
Reputation: 11
To find the call overhead, call a CUDA kernel that does as little as possible.
for (int i=0; i<NLoops; i++) {
gettimeofday(&cpuStart, 0); // get start time on CPU
// Call minimal CUDA kernel
gettimeofday(&cpuEnd, 0); // get end time on CPU
// save elapsed time
}
Follow the code of Alex P. above.
The less processing you do in the kernel, the more the time difference will be only the call overhead.
Do a little experimenting to find a good value for NLoops (maybe 1,000,000). Be sure that the elapsed time is longer than the interval of your timer, or you'll end up with all zeros. If that happens, write some kernel code that executes in a fixed time interval that you can predict: (n loops of x cycles each).
It's hard to remove all the non-CUDA computations that might occur between cpuStart and cpuEnd (like interrupt processing), but making several runs and averaging can give good results.
Upvotes: 1
Reputation: 770
Are your matrices already on the GPU? If not, CUBLAS might transfer them for you (known as thunking), which is an additional overhead.
Also, GPUs do not really shine for such small computations, i.e. it will probably be slower than CPUs since you have to transfer your result back. If you can, use bigger matrices. Otherwise you might want to use streams (cudaStream_t) to start multiple parallel computations on the GPU.
If you want to measure the execution time of a kernel in CUDA, you need to enclose that (or anything else that computes on the GPU) in events, like this when using the CUDA runtime API:
cudaEvent_t start, stop;
cudaEventRecord(&start);
struct timeval cpuStart, cpuEnd;
gettimeofday(&cpuStart, 0); // get start time on CPU
// Do something with CUDA on the GPU, e.g. call kernels, transfer memory, ...
gettimeofday(&cpuEnd, 0); // get end time on CPU
double seconds = cpuEnd.tv_sec - cpuStart.tv_sec;
double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec;
double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds
cudaEventRecord(&stop);
// Wait until the stop event occurred
cudaError_t eventResult;
do
{
eventResult = cudaEventQuery(stop);
}
while (eventResult == cudaErrorNotReady);
// Assert there was no error; check the CUDA Toolkit Reference for further info
assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert>
// Retrieve the time
float gpuDuration = 0.0; // in milliseconds
cudaEventElapsedTime(&gpuDuration, start, stop);
// Release the event objects
cudaEventDestroy(stop);
cudaEventDestroy(start);
You might want to check the error code of every call to CUDA (at least with an assert), as you may get errors from previous calls, resulting in hours of debugging...
(Note: I mostly use the CUDA driver API, so this might not work out of the box. Sorry for that.)
EDIT: Just saw that you want to measure the call itself, not the duration of the kernel. You can do that by simply measuring the time on the CPU for the call - see the updated code above. This works only on Linux because gettimeofday is not available for Windows (AFAIK).
Upvotes: 1
Reputation: 5087
You can get the time in nanoseconds from the device when an event was queued, submitted, started, and finished by using clGetEventProfilingInfo on your buffer transfer event.
more info, and how to set it up here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
I think that for 100x100 matrices, you may be better off sticking to cpu for the crunching. Unless you have many to multiply at the same time, the benefit of the gpu will be hardly noticeable due to the (small) transfer overhead and usually much lower clock speeds. Make sure you tweak your kernel to use as much of the local data as possible - on my hardware, there is 32KB per work group, and that should be plenty to hold two 100x100 matrices. The built-in dot product functions should also be very handy too.
There was an awesome talk about this at ADFS last year (see sessionId: 2908) http://developer.amd.com/afds/pages/OLD/sessions.aspx They talk in detail about optimizing the kernel, and hard-coding the optimal sizes.
Upvotes: 1