cudaThreadSynchronize & performance

Question

Some days ago I was comparing performance of some code of mine where I perform a very simple replace and Thrust implementation of the same algorithm. I discovered a mismatching of one order of magnitude (!) in favor of Thrust, so I started to make my debugger "surf" into their code to discover where the magic happens.

Surprisingly, I discovered that my very straight-forward implementation was actually very similar to theirs, once I got rid of all the functor stuff and got to the nitty-gritty. I saw that Thrust has a clever way to decide both block _size & grid_size (btw: exactly, how it works?!), so I just took their settings and executed my code again, being them so similar. I gained some microseconds, but almost the same situation. Then, in the end, I don't know why but just "to try" I removed a cudaThreadSynchronize() after my kernel and BINGO! I zeroed the gap (and better) and gained a whole order of magnitude of execution time. Accessing my array's value I saw that they had exactly what I expected, so correct execution.

The questions, now, are: when can I get rid of cudaThreadSynchronize (et similia)? Why does it cause such a huge overhead? I see that Thrust itself doesn't synchronize at the end (synchronize_if_enabled(const char* message) that is a NOP if macro __THRUST_SYNCHRONOUS isn't defined and it isn't). Details & code follow.

// my replace code
template 
__global__ void replaceSimple(T* dev, const int n, const T oldval, const T newval)
{
    const int gridSize = blockDim.x * gridDim.x;
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    while(index < n)
    {
        if(dev[index] == oldval)
            dev[index] = newval;
        index += gridSize;
    }
}

// replace invocation - not in main because of cpp - cu separation
template 
void callReplaceSimple(T* dev, const int n, const T oldval, const T newval)
{       
    replaceSimple<<<30,768,0>>>(dev,n,oldval,newval);
    cudaThreadSynchronize();
}

// thrust replace invocation
template 
void callReplace(thrust::device_vector& dev, const T oldval, const T newval)
{
    thrust::replace(dev.begin(), dev.end(), oldval, newval);
}

Param details: arrays: n=10,000,000 elements set to 2, oldval=2, newval=3

Time to execute thrust callReplace (thrust): 0.057 ms
Time to execute callReplaceSimple with sync: 0.662 ms
Time to execute callReplaceSimple without sync: 0.011 ms

I used CUDA 5.0 with Thrust included, my card is a GeForce GTX 570 and I have a quadcore Q9550 2.83 GHz with 2 GB RAM.

tera · Accepted Answer

Kernel launches are asynchronous. If you remove the cudaThreadSynchronize() call, you only measure the kernel launch time, not the time until completion of the kernel.

cudaThreadSynchronize & performance

Answers (1)

Related Questions

cudaThreadSynchronize &amp; performance

Answers (1)

Related Questions

cudaThreadSynchronize & performance