Reputation: 4903

How is cudaMemset implemented?

How CUDA executes cudaMemset() function? I've observed considerable time saving if memory initialization is implemented by launching number of threads = number of elements. Why such saving is achieved?

Upvotes: 3

Answers (1)

Greg Smith

Reputation: 11529

cudaMemset calls cuMemsetD8 or cuMemsetD8Aysnc. This is easy to determine in the tools. The driver implementation will try to optimize the execution based upon the alignment of destination address, size of value to write, and the number of bytes to write. This is easy to determine by writing a few benchmarks. The CUDA implementation has to handle all cases (8-bit alignment, tails, ...). If you have very specific cases (32-bit aligned, divisible by 4) then you should be able to write a kernel that will exceed the performance of the driver implementation in terms of the CPU overhead. The GPU execution time is likely to be similar.

In terms of efficiently writing memory you need to consider several device limits.

Each SM can issue 1 LSU instruction per cycle. On Fermi you need 2 warps and Kepler you need 4 warps to achieve this.
Each SM can perform one write to L2 per cycle.

The simple mapping of 1 thread per element (be it 8-bit or 128-bit) is easy to implement and is fairly easy to handle conditional checks if the size is not a multiple of WARP_SIZE.

Upvotes: 2

How is cudaMemset implemented?

Answers (1)

Related Questions