Reputation: 11
The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
Upvotes: 1
Views: 200
Reputation:
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
Upvotes: 1