Reputation: 405
Some days ago I was comparing performance of some code of mine where I perform a very simple replace and Thrust implementation of the same algorithm. I discovered a mismatching of one order of magnitude (!) in favor of Thrust, so I started to make my debugger "surf" into their code to discover where the magic happens.
Surprisingly, I discovered that my very straight-forward implementation was actually very similar to theirs, once I got rid of all the functor stuff and got to the nitty-gritty. I saw that Thrust has a clever way to decide both block _size & grid_size (btw: exactly, how it works?!), so I just took their settings and executed my code again, being them so similar. I gained some microseconds, but almost the same situation. Then, in the end, I don't know why but just "to try" I removed a cudaThreadSynchronize() after my kernel and BINGO! I zeroed the gap (and better) and gained a whole order of magnitude of execution time. Accessing my array's value I saw that they had exactly what I expected, so correct execution.
The questions, now, are: when can I get rid of cudaThreadSynchronize (et similia)? Why does it cause such a huge overhead? I see that Thrust itself doesn't synchronize at the end (synchronize_if_enabled(const char* message) that is a NOP if macro __THRUST_SYNCHRONOUS isn't defined and it isn't). Details & code follow.
// my replace code
template <typename T>
__global__ void replaceSimple(T* dev, const int n, const T oldval, const T newval)
{
const int gridSize = blockDim.x * gridDim.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
while(index < n)
{
if(dev[index] == oldval)
dev[index] = newval;
index += gridSize;
}
}
// replace invocation - not in main because of cpp - cu separation
template <typename T>
void callReplaceSimple(T* dev, const int n, const T oldval, const T newval)
{
replaceSimple<<<30,768,0>>>(dev,n,oldval,newval);
cudaThreadSynchronize();
}
// thrust replace invocation
template <typename T>
void callReplace(thrust::device_vector<T>& dev, const T oldval, const T newval)
{
thrust::replace(dev.begin(), dev.end(), oldval, newval);
}
Param details: arrays: n=10,000,000 elements set to 2, oldval=2, newval=3
I used CUDA 5.0 with Thrust included, my card is a GeForce GTX 570 and I have a quadcore Q9550 2.83 GHz with 2 GB RAM.
Upvotes: 0
Views: 2267
Reputation: 7245
Kernel launches are asynchronous. If you remove the cudaThreadSynchronize()
call, you only measure the kernel launch time, not the time until completion of the kernel.
Upvotes: 6