Reputation: 16796
I use Compute Visual Profiler to measure the performance of my CUDA programs.
The result of the profiler shows 2 different results for the cudaMemset function.
I want to know what is the difference between these 2?
Upvotes: 0
Views: 394
Reputation: 21108
I would guess that the memset128 kernel does the bulk of the work and the memset32_post kernel cleans up the remainder since you used a size that is not a multiple of 128.
There's nothing to worry about, it's just trying to implement the memset in the most efficient manner possible, although I'd try to avoid memset in an inner-loop (on any processor). If you're really worried about this you could over-allocate.
Upvotes: 1