Reputation: 4903
How CUDA executes cudaMemset()
function? I've observed considerable time saving if memory initialization is implemented by launching number of threads = number of elements. Why such saving is achieved?
Upvotes: 3
Views: 1863
Reputation: 11529
cudaMemset calls cuMemsetD8 or cuMemsetD8Aysnc. This is easy to determine in the tools. The driver implementation will try to optimize the execution based upon the alignment of destination address, size of value to write, and the number of bytes to write. This is easy to determine by writing a few benchmarks. The CUDA implementation has to handle all cases (8-bit alignment, tails, ...). If you have very specific cases (32-bit aligned, divisible by 4) then you should be able to write a kernel that will exceed the performance of the driver implementation in terms of the CPU overhead. The GPU execution time is likely to be similar.
In terms of efficiently writing memory you need to consider several device limits.
The simple mapping of 1 thread per element (be it 8-bit or 128-bit) is easy to implement and is fairly easy to handle conditional checks if the size is not a multiple of WARP_SIZE.
Upvotes: 2