CUDA stream is slower than usual kernel

Question

I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function...

why is this code slower

cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1);    
addKernel<<>>(pole_dev);
cudaMemcpyAsync(pole, pole_dev, size, cudaMemcpyDeviceToHost, stream_1);
cudaThreadSynchronize();  // I don't know difference between cudaThreadSync and cudaDeviceSync
cudaDeviceSynchronize();  // it acts relatively same...

than:

cudaMemcpy(pole_dev, pole, size, cudaMemcpyHostToDevice);
addKernel<<>>(pole_dev);
cudaMemcpy(pole, pole_dev, size, cudaMemcpyDeviceToHost);

I thounght that it should run faster ... value of variable count is 6 500 000 (maximum) ... first source code takes 14 millisecconds and second source code takes 11 milliseconds.

Can anybody explain it to me, please?

CUDA stream is slower than usual kernel

Answers (1)

Related Questions